[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-06 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321
 ] 

Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM:
---

Hi, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!


was (Author: alexandre.madur...@gmail.com):
Hi again, [~talli...@apache.org],

I was checking the specs doc again, and I read on page 17 the difference 
between Bag and Seq. Beats me why Adobe would choose an unordered array over an 
ordered array for the Author field in Acrobat's document properties form. In 
any case, as you mentioned, it makes it necessary to check on both before 
falling back to PDDocumentInformation's getAuthor().

I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper 
instead of a Seq one. I'll open a ticket on Adobe's bugbase.

In the meantime, I modified the XSLT file I was using to automate the metadata 
insertion so it uses the rdf:Seq, and will re-process the entire collection 
(I will probably add PDFBox to the next implementation of our automated 
metadata insertion workflow, thanks again for the tip!).

Have a great one!

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
Assignee: Tim Allison
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-04 Thread Alexandre Madurell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Madurell updated TIKA-1252:
-

Attachment: Sample.xmp
Sample.pdf

Thanks so much!

Follows a blank sample PDF with the XMP metadata imported into it (just like we 
do with the full documents).

In the meantime, I'll try modifying the schema and XMP data so we use a custom 
field for the document authors (those who wrote the article, book review, 
letter to editor, etc) and leave Acrobat's creator field for the publisher 
(single entry). If that works, we can check if there's any difference in the 
parser's code for custom and non-custom fields. 

Thanks again! I'll get back with the results of the test ASAP.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
 Attachments: Sample.pdf, Sample.xmp


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-04 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919858#comment-13919858
 ] 

Alexandre Madurell commented on TIKA-1252:
--

Hello, Tim Allison

I've created a couple of files with a single author (Acrobat 5.x and Acrobat 
4.x), but it is always wrapped in a bag when I export the .xmp:

{code:xml}
 dc:creator
rdf:Bag
   rdf:liSingle Author/rdf:li
/rdf:Bag
 /dc:creator
{code}

I'm attaching both, anyways.

Also, I've tried importing an XMP which uses {code:xml}rdf:Seq{code} instead 
of {code:xml}rdf:Bag{code} and Acrobat seems to keep it and display it in its 
properties panel. I'm attaching both PDFs (one author, two authors, with Seq) 
and a screenshot of the properties panel.

This does definitely let me go ahead with indexing our documents.

As to your last comment, {code:xml}rdf:Bag{code} is definitely what came out of 
Acrobat X by exporting the XMP on a clean brand new PDF (after typing the 
Author in the properties panel), so I guess it is worth checking on both.

I'll also take a good look at PDFBox (I've just checked out the repo's trunk).

P.S. This community is AWESOME!!! (I'm not used to receiving comments faster 
than I can reply to them... -twice!- thrice!) :)

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
 Attachments: Sample.pdf, Sample.xmp


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-04 Thread Alexandre Madurell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Madurell updated TIKA-1252:
-

Attachment: Sample (Acrobat 4.x).pdf
Sample (Acrobat 5.x).pdf
Sample-One-Author.pdf
Sample-Two-Authors.pdf
XMP-Import-with-Seq.jpg

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell
 Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, 
 Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, 
 XMP-Import-with-Seq.jpg


 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1232) Add PDF version to PDFParser output

2014-03-04 Thread Alexandre Madurell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Madurell updated TIKA-1232:
-

Attachment: Sample 10.x.pdf
Sample 9.x.pdf
Sample 8.x.pdf
Sample 7.x.pdf
Sample 6.x.pdf
Sample 5.x.pdf
Sample 4.x.pdf

Here go:
Sample 4.x.pdf (PDF Version 1.3)
Sample 5.x.pdf (PDF Version 1.4)
Sample 6.x.pdf (PDF Version 1.5)
Sample 7.x.pdf (PDF Version 1.6)
Sample 8.x.pdf (PDF Version 1.7)
Sample 9.x.pdf (PDF Version 1.7 Adobe Extension Level 3)
Sample 10.x.pdf (PDF Version 1.7 Adobe Extension Level 8)

Sample 11.x.pdf coming up next

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 
 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, 
 TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1232) Add PDF version to PDFParser output

2014-03-04 Thread Alexandre Madurell (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Madurell updated TIKA-1232:
-

Attachment: Sample 11.x PDFA-1b.pdf

I'm having trouble outputting to other PDFA formats (MarkInfo missing bla bla). 
I'll keep checking as soon as I can. In the meantime, here's a PDFA-1b. BTW: 
Regular Acrobat XI format is the same as Acrobat X (PDF Version 1.7 Adobe 
Extension Level 8)

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-03 Thread Alexandre Madurell (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918450#comment-13918450
 ] 

Alexandre Madurell commented on TIKA-1252:
--

Hmmm... maybe I need to build a DublinCoreAdapter on top of Tika's Metadata 
class as mentioned here? 
http://lucene.472066.n3.nabble.com/Metadata-use-by-Apache-Java-projects-td645477.html#a645484

Kind of a newbie here... any help is appreciated.

 Tika is not indexing all authors of a PDF
 -

 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
 Bitnami Stack)
Reporter: Alexandre Madurell

 When submitting a PDF with this information in its XMP metadata:
 ...
   dc:creator
 rdf:Bag
   rdf:liAuthor 1/rdf:li
   rdf:liAuthor 2/rdf:li
 /rdf:Bag
   /dc:creator
 ...
 Only the first one appears in the collection:
 ...
 author:[Author 1],
 author_s:Author 1,
 ...
 In spite of having set the field to multiValued in the Solr schema:
 field name=author type=text_general indexed=true stored=true 
 multiValued=true/
 Let me know if there's any further specific information I could provide.
 Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1252) Tika is not indexing all authors of a PDF

2014-03-02 Thread Alexandre Madurell (JIRA)
Alexandre Madurell created TIKA-1252:


 Summary: Tika is not indexing all authors of a PDF
 Key: TIKA-1252
 URL: https://issues.apache.org/jira/browse/TIKA-1252
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.4
 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, 
Bitnami Stack)
Reporter: Alexandre Madurell


When submitting a PDF with this information in its XMP metadata:
...
  dc:creator
rdf:Bag
  rdf:liAuthor 1/rdf:li
  rdf:liAuthor 2/rdf:li
/rdf:Bag
  /dc:creator
...
Only the first one appears in the collection:
...
author:[Author 1],
author_s:Author 1,
...

In spite of having set the field to multiValued in the Solr schema:

field name=author type=text_general indexed=true stored=true 
multiValued=true/

Let me know if there's any further specific information I could provide.

Thanks in advance! 



--
This message was sent by Atlassian JIRA
(v6.2#6252)