[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321
 ] 

Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:
---

Hi, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!


was (Author: alexandre.madur...@gmail.com):
Hi again, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM:
--

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate keys (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.


was (Author: thetaphi):
I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)