[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922321#comment-13922321
 ] 

Alexandre Madurell commented on TIKA-1252:
------------------------------------------

Hi again, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the <rdf:Seq>, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!

> Tika is not indexing all authors of a PDF
> -----------------------------------------
>
>                 Key: TIKA-1252
>                 URL: https://issues.apache.org/jira/browse/TIKA-1252
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 1.4
>         Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
> Bitnami Stack)
>            Reporter: Alexandre Madurell
>            Assignee: Tim Allison
>         Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
> Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
> XMP-Import-with-Seq.jpg
>
>
> When submitting a PDF with this information in its XMP metadata:
> ...
>       <dc:creator>
>         <rdf:Bag>
>           <rdf:li>Author 1</rdf:li>
>           <rdf:li>Author 2</rdf:li>
>         </rdf:Bag>
>       </dc:creator>
> ...
> Only the first one appears in the collection:
> ...
>         "author":["Author 1"],
>         "author_s":"Author 1",
> ...
> In spite of having set the field to multiValued in the Solr schema:
> <field name="author" type="text_general" indexed="true" stored="true" 
> multiValued="true"/>
> Let me know if there's any further specific information I could provide.
> Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to