Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): For each row, extract the content from the corresponding pdf file using a parser library of your choice (I suggest Apache PDFBox or Apache Tika in case you need to process other file types as well), put it between

        <foo><![CDATA[

and

        ]]></foo>

and store it in a text file. To keep the relationship between a file and its corresponding database row, use the primary key as the file name.

Within data-config.xml use the XPathEntityProcessor as follows (replace dbRow and primaryKey respectively):

<entity name="pdfcontent"
        processor="XPathEntityProcessor"
        forEach="/foo"
        url="${dbRow.primaryKey}.xml">
  <field column="pdftext" xpath="/foo"/>
</entity>


And, by the way, in Solr 1.4 you do not have to put your content between xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.

Best,
Sascha

Khai Doan schrieb:
Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan


Reply via email to