Re: How to use DataImportHandler with ExtractingRequestHandler?

Sascha Szott Thu, 03 Sep 2009 10:50:28 -0700

Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3):For each row, extract the content from the corresponding pdf file usinga parser library of your choice (I suggest Apache PDFBox or Apache Tikain case you need to process other file types as well), put it between


        <foo><![CDATA[

and

        ]]></foo>

and store it in a text file. To keep the relationship between a file andits corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replacedbRow and primaryKey respectively):


<entity name="pdfcontent"
        processor="XPathEntityProcessor"
        forEach="/foo"
        url="${dbRow.primaryKey}.xml">
  <field column="pdftext" xpath="/foo"/>
</entity>

And, by the way, in Solr 1.4 you do not have to put your content betweenxml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan

Re: How to use DataImportHandler with ExtractingRequestHandler?

Reply via email to