[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922321#comment-13922321 ]
Alexandre Madurell commented on TIKA-1252: ------------------------------------------ Hi again, [~talli...@apache.org], I was checking the specs doc again, and I read on page 17 the difference between Bag and Seq. Beats me why Adobe would choose an unordered array over an ordered array for the Author field in Acrobat's document properties form. In any case, as you mentioned, it makes it necessary to check on both before falling back to PDDocumentInformation's getAuthor(). I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper instead of a Seq one. I'll open a ticket on Adobe's bugbase. In the meantime, I modified the XSLT file I was using to automate the metadata insertion so it uses the <rdf:Seq>, and will re-process the entire collection (I will probably add PDFBox to the next implementation of our automated metadata insertion workflow, thanks again for the tip!). Have a great one! > Tika is not indexing all authors of a PDF > ----------------------------------------- > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) > Reporter: Alexandre Madurell > Assignee: Tim Allison > Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, > Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, > XMP-Import-with-Seq.jpg > > > When submitting a PDF with this information in its XMP metadata: > ... > <dc:creator> > <rdf:Bag> > <rdf:li>Author 1</rdf:li> > <rdf:li>Author 2</rdf:li> > </rdf:Bag> > </dc:creator> > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > <field name="author" type="text_general" indexed="true" stored="true" > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)