subject:"\[jira\] \[Comment Edited\] \(TIKA\-1252\) Tika is not indexing all authors of a PDF"

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Alexandre Madurell (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321
]

Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:
---

Hi, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference
between Bag and Seq. Beats me why Adobe would choose an unordered array over an
ordered array for the Author field in Acrobat's document properties form. In
any case, as you mentioned, it makes it necessary to check on both before
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata
insertion so it uses the rdf:Seq, and will re-process the entire collection
(I will probably add PDFBox to the next implementation of our automated
metadata insertion workflow, thanks again for the tip!).

Have a great one!

was (Author: alexandre.madur...@gmail.com):
Hi again, [~talli...@apache.org],

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

Have a great one!

Tika is not indexing all authors of a PDF
-

Key: TIKA-1252
URL: https://issues.apache.org/jira/browse/TIKA-1252
Project: Tika
Issue Type: Bug
Components: metadata, parser
Affects Versions: 1.4
Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services,
Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf,
Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp,
XMP-Import-with-Seq.jpg

When submitting a PDF with this information in its XMP metadata:
...
dc:creator
rdf:Bag
rdf:liAuthor 1/rdf:li
rdf:liAuthor 2/rdf:li
/rdf:Bag
/dc:creator
...
Only the first one appears in the collection:
...
author:[Author 1],
author_s:Author 1,
...
In spite of having set the field to multiValued in the Solr schema:
field name=author type=text_general indexed=true stored=true
multiValued=true/
Let me know if there's any further specific information I could provide.
Thanks in advance!

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918643#comment-13918643
 ] 

Uwe Schindler edited comment on TIKA-1252 at 3/3/14 10:17 PM:
--

I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate keys (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.


was (Author: thetaphi):
I did a quick check in 
[https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java]

Solr does not seem to remove duplicate values (see {{addMetadata()}} and 
{{addField(String fname, String fval, String[] vals)}}). Furthermore, if the 
field is *not* multivalued, the data is concatenated with whitespace and put 
into *one* field (see line 226 ff).

So this looks like a configuration problem or really a bug in TIKA.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2 matches

Site Navigation

Mail list logo

Footer information