Many thanks both Mike and Alexandre. I'll peek those tools. Lux seems a good option. Thanks again,
Francisco El 27/09/2013, a las 09:33, Michael Sokolov escribió: > You might be interested in Lux (http://luxdb.org), which is designed for > indexing and querying XML using Solr and Lucene. It can run index-supported > XPath/XQuery over your documents, and you can define arbitrary XPath indexes. > > -Mike > > On 9/27/13 6:28 AM, Francisco Fernandez wrote: >> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar >> structure to: >> >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418 >> >> The nodes I need to extract, expressed as XPaths would be: >> >> //PubmedArticle/MedlineCitation/PMID >> //PubmedArticle/MedlineCitation/DateCreated/Year >> //PubmedArticle/MedlineCitation/Article/ArticleTitle >> //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText >> //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading >> >> I think a way to index them in Solr is to create another xml structure >> similar to: >> <add> >> <doc> >> <field name="id">PMID</field> >> <field name="year_i">Year</field> >> <field name="name">ArticleTitle</field> >> <field name="abstract_s">AbstractText</field> >> <field name="cat">MeshHeading1</field> >> <field name="cat">MeshHeading2</field> >> </doc> >> </add> >> >> Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of >> low-molecular-weight heparin compared with aspirin for prophylaxis against >> venous thromboembolism after total joint arthroplasty' and so on. >> With that structure I would post it to Solr using the following statement >> over the documents folder >> java -jar post.jar *.xml >> >> I'm wondering if is there a more direct way to perform the same task that >> does not imply a 'iterate->parsing->restructure->write to disk->post' cycle >> Many thanks >> >> Francisco >