Did you look at dataImportHandler? There is also Flume, I think. Regards, Alex On 27 Sep 2013 17:28, "Francisco Fernandez" <fra...@gmail.com> wrote:
> Hi, I'm a newby trying to index PubMed texts obtained as xml with similar > structure to: > > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=23864173,22073418 > > The nodes I need to extract, expressed as XPaths would be: > > //PubmedArticle/MedlineCitation/PMID > //PubmedArticle/MedlineCitation/DateCreated/Year > //PubmedArticle/MedlineCitation/Article/ArticleTitle > //PubmedArticle/MedlineCitation/Article/Abstract/AbstractText > //PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading > > I think a way to index them in Solr is to create another xml structure > similar to: > <add> > <doc> > <field name="id">PMID</field> > <field name="year_i">Year</field> > <field name="name">ArticleTitle</field> > <field name="abstract_s">AbstractText</field> > <field name="cat">MeshHeading1</field> > <field name="cat">MeshHeading2</field> > </doc> > </add> > > Being "PMID" = '23864173' and "ArticleTitle" = 'Cost-effectiveness of > low-molecular-weight heparin compared with aspirin for prophylaxis against > venous thromboembolism after total joint arthroplasty' and so on. > With that structure I would post it to Solr using the following statement > over the documents folder > java -jar post.jar *.xml > > I'm wondering if is there a more direct way to perform the same task that > does not imply a 'iterate->parsing->restructure->write to disk->post' cycle > Many thanks > > Francisco