Hi,

I think this is a bug but before reporting to issue tracker I thought I will ask it here first. So the problem is I have a PDF file which among other metadata fields like Author, CreatedDate etc. has a metadata field Category (I can see all metadata fields with tika-app.jar started in GUI mode). Now what happens that in my SOLR schema I have a "Category" field also among other fields and a field called "text"
that is holding the extracted text from the PDF.
I added <str name="uprefix">metadata_</str> so all PDF metadata fields should be saved in solr as "metadata_something" fields. The problem is that the "Category" metadata field from the PDF for some reason is not prefixed with "metadata_" and solr will merge the "Category" field I have in the schema with the Category metadata from PDF and I will have an error like:
"multiple values encountered for non multiValued field Category"
I fixed this by patching tika-parsers.jar and will ignore the Category metadata in
org.apache.tika.parser.pdf.PDFParser
but this is not the good solution( I don't need that Category metadata so it works for me).

So let me know if this should be reported as bug or not.

Regards,
Andras.






Reply via email to