Hi Khai,
a few weeks ago, I was facing the same problem.
In my case, this workaround helped (assuming, you're using Solr 1.3):
For each row, extract the content from the corresponding pdf file using
a parser library of your choice (I suggest Apache PDFBox or Apache Tika
in case you need to process other file types as well), put it between
<foo><![CDATA[
and
]]></foo>
and store it in a text file. To keep the relationship between a file and
its corresponding database row, use the primary key as the file name.
Within data-config.xml use the XPathEntityProcessor as follows (replace
dbRow and primaryKey respectively):
<entity name="pdfcontent"
processor="XPathEntityProcessor"
forEach="/foo"
url="${dbRow.primaryKey}.xml">
<field column="pdftext" xpath="/foo"/>
</entity>
And, by the way, in Solr 1.4 you do not have to put your content between
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.
Best,
Sascha
Khai Doan schrieb:
Hi all,
My name is Khai. I have a table in a relational database. I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file. How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?
Thanks!
Khai Doan