Re: Pubmed XML indexing

Michael Sokolov Fri, 27 Sep 2013 05:34:03 -0700

You might be interested in Lux (http://luxdb.org), which is designed forindexing and querying XML using Solr and Lucene. It can runindex-supported XPath/XQuery over your documents, and you can definearbitrary XPath indexes.


-Mike


On 9/27/13 6:28 AM, Francisco Fernandez wrote:

Hi, I'm a newby trying to index PubMed texts obtained as xml with similar 
structure to:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418

The nodes I need to extract, expressed as XPaths would be:

//PubmedArticle/MedlineCitation/PMID
//PubmedArticle/MedlineCitation/DateCreated/Year
//PubmedArticle/MedlineCitation/Article/ArticleTitle
//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText
//PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading

I think a way to index them in Solr is to create another xml structure similar 
to:
<add>
<doc>
  <field name="id">PMID</field>
  <field name="year_i">Year</field>
  <field name="name">ArticleTitle</field>
  <field name="abstract_s">AbstractText</field>
  <field name="cat">MeshHeading1</field>
  <field name="cat">MeshHeading2</field>
</doc>
</add>

Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of 
low-molecular-weight heparin compared with aspirin for prophylaxis against venous thromboembolism 
after total joint arthroplasty' and so on.
With that structure I would post it to Solr using the following statement over 
the documents folder
java -jar post.jar *.xml

I'm wondering if is there a more direct way to perform the same task that does not imply a 
'iterate->parsing->restructure->write to disk->post' cycle
Many thanks

Francisco

Re: Pubmed XML indexing

Reply via email to