bug in ExtractingRequestHandler with PDFs and metadata field Category

Andras Balogh Thu, 07 Jul 2011 06:41:08 -0700

Hi,

I think this is a bug but before reporting to issue tracker Ithought I will ask it here first.So the problem is I have a PDF file which among other metadata fieldslike Author, CreatedDate etc. has a metadatafield Category (I can see all metadata fields with tika-app.jar startedin GUI mode).Now what happens that in my SOLR schema I have a "Category" field alsoamong other fields and a field called "text"

that is holding the extracted text from the PDF.

I added <str name="uprefix">metadata_</str> so all PDF metadata fieldsshould be saved in solr as "metadata_something" fields.The problem is that the "Category" metadata field from the PDF for somereason is not prefixed with "metadata_" andsolr will merge the "Category" field I have in the schema with theCategory metadata from PDF and I will have an error like:

"multiple values encountered for non multiValued field Category"

I fixed this by patching tika-parsers.jar and will ignore the Categorymetadata in

org.apache.tika.parser.pdf.PDFParser

but this is not the good solution( I don't need that Category metadataso it works for me).


So let me know if this should be reported as bug or not.

Regards,
Andras.

bug in ExtractingRequestHandler with PDFs and metadata field Category

Reply via email to