[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919763#comment-13919763 ]
Tim Allison commented on TIKA-1252: ----------------------------------- This grabs both authors: {noformat} org.apache.jempbox.xmp.XMPMetadata xmp = document.getDocumentCatalog().getMetadata().exportXMPMetadata(); XMPSchemaDublinCore pdfdc = xmp.getDublinCoreSchema(); List<String> creators = pdfdc.getBagList("dc:creator"); if (creators != null){ for (String c : creators){ addMetadata(metadata, TikaCoreProperties.CREATOR, c); } } {noformat} The full output with this small mod is: {noformat} Period : October-December dc:subject : meta:save-date : 2014-03-04T08:08:31Z Type : Article subject : Author : Sample Author 1 Author : Sample Author 2 dcterms:created : 2014-03-04T08:05:57Z date : 2014-03-04T08:08:31Z Month : 1999-10 creator : Sample Author 1 creator : Sample Author 2 Creation-Date : 2014-03-04T08:05:57Z title : Sample Title meta:author : Sample Author 1 meta:author : Sample Author 2 created : Tue Mar 04 03:05:57 EST 2014 Page : 99 meta:keyword : dc:format : PDF Version 1.6 Sequence : 2 xmp:CreatorTool : Adobe Acrobat 10.0 Keywords : dc:title : Sample Title Last-Save-Date : 2014-03-04T08:08:31Z CitationName : AUTHOR 1, Sample; AUTHOR 2, Sample meta:creation-date : 2014-03-04T08:05:57Z dcterms:modified : 2014-03-04T08:08:31Z Volume : 1 Number : 9 dc:creator : Sample Author 1 dc:creator : Sample Author 2 pdf:PDFVersion : 1.6 Last-Modified : 2014-03-04T08:08:31Z Related : 0 Citation : AUTHOR 1, Sample; AUTHOR 2, Sample (1999). <em>Sample Title</em>, Journal of Sample Organization, ORG - Sample Organization, October-December, Vol. 1, No. 9, p.99 modified : 2014-03-04T08:08:31Z xmpTPg:NPages : 1 pdf:encrypted : false Edition : ENES producer : Acrobat Web Capture 10.0 Language : EN Content-Type : application/pdf {noformat} The current Tika code relies on PDFBox's PDDocumentInformation's getAuthor(), which returns a single value. {noformat} PDDocumentInformation info = document.getDocumentInformation(); ... addMetadata(metadata, TikaCoreProperties.CREATOR, info.getAuthor()); {noformat} [~alexandre.madur...@gmail.com], would you be able to attach another pdf where you've inserted only one author? (Not bagged) I'd like to add both pdf files to a unit test for this issue's fix. My proposal is to try to get the creator bagList, and if that returns null, fall back to our existing code to get the author. Sound good? > Tika is not indexing all authors of a PDF > ----------------------------------------- > > Key: TIKA-1252 > URL: https://issues.apache.org/jira/browse/TIKA-1252 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.4 > Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, > Bitnami Stack) > Reporter: Alexandre Madurell > Attachments: Sample.pdf, Sample.xmp > > > When submitting a PDF with this information in its XMP metadata: > ... > <dc:creator> > <rdf:Bag> > <rdf:li>Author 1</rdf:li> > <rdf:li>Author 2</rdf:li> > </rdf:Bag> > </dc:creator> > ... > Only the first one appears in the collection: > ... > "author":["Author 1"], > "author_s":"Author 1", > ... > In spite of having set the field to multiValued in the Solr schema: > <field name="author" type="text_general" indexed="true" stored="true" > multiValued="true"/> > Let me know if there's any further specific information I could provide. > Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)