[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13919871#comment-13919871
 ] 

Tim Allison commented on TIKA-1252:
-----------------------------------

[~alexandre.madur...@gmail.com], oh...doesn't matter whether it is valid or not 
if that is what Adobe is generating. :)  You may want to ask about this on the 
PDFBox user's list and/or open an issue over there.  If you do open an issue, 
please link this issue to whatever you open up.  

So, y, on the Tika side, we'll have to code against both seq and bag.  My 
proposal is to check the xmp for creator, if nothing exists there, then fall 
back to PDDocumentInformation's getAuthor().  The limitation to this proposal 
is that there may be author information elsewhere in the PDF that we'd be 
ignoring if the xmp component had an author.  Are we ok with this or do we want 
potentially duplicative information (include both getAuthor() and whatever we 
get from xmp)?

> Tika is not indexing all authors of a PDF
> -----------------------------------------
>
>                 Key: TIKA-1252
>                 URL: https://issues.apache.org/jira/browse/TIKA-1252
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.4
>         Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>            Reporter: Alexandre Madurell
>         Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
> Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
> XMP-Import-with-Seq.jpg
>
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>       <dc:creator>
>         <rdf:Bag>
>           <rdf:li>Author 1</rdf:li>
>           <rdf:li>Author 2</rdf:li>
>         </rdf:Bag>
>       </dc:creator>
> ...
> Only the first one appears in the collection:
> ...
>         "author":["Author 1"],
>         "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
> <field name="author" type="text_general" indexed="true" stored="true" 
> multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to