[
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321
]
Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:
---
Hi, [~talli...@apache.org],
I was checking the specs doc again, and I read on page 17 the difference
between Bag and Seq. Beats me why Adobe would choose an unordered array over an
ordered array for the Author field in Acrobat's document properties form. In
any case, as you mentioned, it makes it necessary to check on both before
falling back to PDDocumentInformation's getAuthor().
I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper
instead of a Seq one. I'll open a ticket on Adobe's bugbase.
In the meantime, I modified the XSLT file I was using to automate the metadata
insertion so it uses the rdf:Seq, and will re-process the entire collection
(I will probably add PDFBox to the next implementation of our automated
metadata insertion workflow, thanks again for the tip!).
Have a great one!
was (Author: alexandre.madur...@gmail.com):
Hi again, [~talli...@apache.org],
I was checking the specs doc again, and I read on page 17 the difference
between Bag and Seq. Beats me why Adobe would choose an unordered array over an
ordered array for the Author field in Acrobat's document properties form. In
any case, as you mentioned, it makes it necessary to check on both before
falling back to PDDocumentInformation's getAuthor().
I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper
instead of a Seq one. I'll open a ticket on Adobe's bugbase.
In the meantime, I modified the XSLT file I was using to automate the metadata
insertion so it uses the rdf:Seq, and will re-process the entire collection
(I will probably add PDFBox to the next implementation of our automated
metadata insertion workflow, thanks again for the tip!).
Have a great one!
Tika is not indexing all authors of a PDF
-
Key: TIKA-1252
URL: https://issues.apache.org/jira/browse/TIKA-1252
Project: Tika
Issue Type: Bug
Components: metadata, parser
Affects Versions: 1.4
Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services,
Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf,
Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp,
XMP-Import-with-Seq.jpg
When submitting a PDF with this information in its XMP metadata:
...
dc:creator
rdf:Bag
rdf:liAuthor 1/rdf:li
rdf:liAuthor 2/rdf:li
/rdf:Bag
/dc:creator
...
Only the first one appears in the collection:
...
author:[Author 1],
author_s:Author 1,
...
In spite of having set the field to multiValued in the Solr schema:
field name=author type=text_general indexed=true stored=true
multiValued=true/
Let me know if there's any further specific information I could provide.
Thanks in advance!
--
This message was sent by Atlassian JIRA
(v6.2#6252)