[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919763#comment-13919763
 ] 

Tim Allison commented on TIKA-1252:
-----------------------------------

This grabs both authors:

{noformat}
                org.apache.jempbox.xmp.XMPMetadata xmp = 
document.getDocumentCatalog().getMetadata().exportXMPMetadata();               
                XMPSchemaDublinCore pdfdc = xmp.getDublinCoreSchema();
                List<String> creators = pdfdc.getBagList("dc:creator");
                if (creators != null){
                    for (String c : creators){
                       addMetadata(metadata, TikaCoreProperties.CREATOR, c);
                    }
                }
{noformat}

The full output with this small mod is:
{noformat}
Period : October-December
dc:subject : 
meta:save-date : 2014-03-04T08:08:31Z
Type : Article
subject : 
Author : Sample Author 1
Author : Sample Author 2
dcterms:created : 2014-03-04T08:05:57Z
date : 2014-03-04T08:08:31Z
Month : 1999-10
creator : Sample Author 1
creator : Sample Author 2
Creation-Date : 2014-03-04T08:05:57Z
title : Sample Title
meta:author : Sample Author 1
meta:author : Sample Author 2
created : Tue Mar 04 03:05:57 EST 2014
Page : 99
meta:keyword : 
dc:format : PDF Version 1.6
Sequence : 2
xmp:CreatorTool : Adobe Acrobat 10.0
Keywords : 
dc:title : Sample Title
Last-Save-Date : 2014-03-04T08:08:31Z
CitationName : AUTHOR 1, Sample; AUTHOR 2, Sample
meta:creation-date : 2014-03-04T08:05:57Z
dcterms:modified : 2014-03-04T08:08:31Z
Volume : 1
Number : 9
dc:creator : Sample Author 1
dc:creator : Sample Author 2
pdf:PDFVersion : 1.6
Last-Modified : 2014-03-04T08:08:31Z
Related : 0
Citation : AUTHOR 1, Sample; AUTHOR 2, Sample (1999). <em>Sample Title</em>, 
Journal of Sample Organization, ORG - Sample Organization, October-December, 
Vol. 1, No. 9, p.99
modified : 2014-03-04T08:08:31Z
xmpTPg:NPages : 1
pdf:encrypted : false
Edition : ENES
producer : Acrobat Web Capture 10.0
Language : EN
Content-Type : application/pdf
{noformat}


The current Tika code relies on PDFBox's  PDDocumentInformation's getAuthor(), 
which returns a single value.
{noformat}
PDDocumentInformation info = document.getDocumentInformation();
...
addMetadata(metadata, TikaCoreProperties.CREATOR, info.getAuthor());
{noformat}

[~alexandre.madur...@gmail.com], would you be able to attach another pdf where 
you've inserted only one author?  (Not bagged)  I'd like to add both pdf files 
to a unit test for this issue's fix.

My proposal is to try to get the creator bagList, and if that returns null, 
fall back to our existing code to get the author.  Sound good?

> Tika is not indexing all authors of a PDF
> -----------------------------------------
>
>                 Key: TIKA-1252
>                 URL: https://issues.apache.org/jira/browse/TIKA-1252
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.4
>         Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>            Reporter: Alexandre Madurell
>         Attachments: Sample.pdf, Sample.xmp
>
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>       <dc:creator>
>         <rdf:Bag>
>           <rdf:li>Author 1</rdf:li>
>           <rdf:li>Author 2</rdf:li>
>         </rdf:Bag>
>       </dc:creator>
> ...
> Only the first one appears in the collection:
> ...
>         "author":["Author 1"],
>         "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
> <field name="author" type="text_general" indexed="true" stored="true" 
> multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to