[jira] [Comment Edited] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922321#comment-13922321 ] Alexandre Madurell edited comment on TIKA-1252 at 3/6/14 11:10 AM: --- Hi, [~talli...@apache.org], I was checking the specs doc again, and I read on page 17 the difference between Bag and Seq. Beats me why Adobe would choose an unordered array over an ordered array for the Author field in Acrobat's document properties form. In any case, as you mentioned, it makes it necessary to check on both before falling back to PDDocumentInformation's getAuthor(). I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper instead of a Seq one. I'll open a ticket on Adobe's bugbase. In the meantime, I modified the XSLT file I was using to automate the metadata insertion so it uses the rdf:Seq, and will re-process the entire collection (I will probably add PDFBox to the next implementation of our automated metadata insertion workflow, thanks again for the tip!). Have a great one! was (Author: alexandre.madur...@gmail.com): Hi again, [~talli...@apache.org], I was checking the specs doc again, and I read on page 17 the difference between Bag and Seq. Beats me why Adobe would choose an unordered array over an ordered array for the Author field in Acrobat's document properties form. In any case, as you mentioned, it makes it necessary to check on both before falling back to PDDocumentInformation's getAuthor(). I've just checked Acrobat XI, and it still exports its XMP with a Bag wrapper instead of a Seq one. I'll open a ticket on Adobe's bugbase. In the meantime, I modified the XSLT file I was using to automate the metadata insertion so it uses the rdf:Seq, and will re-process the entire collection (I will probably add PDFBox to the next implementation of our automated metadata insertion workflow, thanks again for the tip!). Have a great one! Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Assignee: Tim Allison Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Madurell updated TIKA-1252: - Attachment: Sample.xmp Sample.pdf Thanks so much! Follows a blank sample PDF with the XMP metadata imported into it (just like we do with the full documents). In the meantime, I'll try modifying the schema and XMP data so we use a custom field for the document authors (those who wrote the article, book review, letter to editor, etc) and leave Acrobat's creator field for the publisher (single entry). If that works, we can check if there's any difference in the parser's code for custom and non-custom fields. Thanks again! I'll get back with the results of the test ASAP. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Attachments: Sample.pdf, Sample.xmp When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919858#comment-13919858 ] Alexandre Madurell commented on TIKA-1252: -- Hello, Tim Allison I've created a couple of files with a single author (Acrobat 5.x and Acrobat 4.x), but it is always wrapped in a bag when I export the .xmp: {code:xml} dc:creator rdf:Bag rdf:liSingle Author/rdf:li /rdf:Bag /dc:creator {code} I'm attaching both, anyways. Also, I've tried importing an XMP which uses {code:xml}rdf:Seq{code} instead of {code:xml}rdf:Bag{code} and Acrobat seems to keep it and display it in its properties panel. I'm attaching both PDFs (one author, two authors, with Seq) and a screenshot of the properties panel. This does definitely let me go ahead with indexing our documents. As to your last comment, {code:xml}rdf:Bag{code} is definitely what came out of Acrobat X by exporting the XMP on a clean brand new PDF (after typing the Author in the properties panel), so I guess it is worth checking on both. I'll also take a good look at PDFBox (I've just checked out the repo's trunk). P.S. This community is AWESOME!!! (I'm not used to receiving comments faster than I can reply to them... -twice!- thrice!) :) Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Attachments: Sample.pdf, Sample.xmp When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Madurell updated TIKA-1252: - Attachment: Sample (Acrobat 4.x).pdf Sample (Acrobat 5.x).pdf Sample-One-Author.pdf Sample-Two-Authors.pdf XMP-Import-with-Seq.jpg Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell Attachments: Sample (Acrobat 4.x).pdf, Sample (Acrobat 5.x).pdf, Sample-One-Author.pdf, Sample-Two-Authors.pdf, Sample.pdf, Sample.xmp, XMP-Import-with-Seq.jpg When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Madurell updated TIKA-1232: - Attachment: Sample 10.x.pdf Sample 9.x.pdf Sample 8.x.pdf Sample 7.x.pdf Sample 6.x.pdf Sample 5.x.pdf Sample 4.x.pdf Here go: Sample 4.x.pdf (PDF Version 1.3) Sample 5.x.pdf (PDF Version 1.4) Sample 6.x.pdf (PDF Version 1.5) Sample 7.x.pdf (PDF Version 1.6) Sample 8.x.pdf (PDF Version 1.7) Sample 9.x.pdf (PDF Version 1.7 Adobe Extension Level 3) Sample 10.x.pdf (PDF Version 1.7 Adobe Extension Level 8) Sample 11.x.pdf coming up next Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandre Madurell updated TIKA-1232: - Attachment: Sample 11.x PDFA-1b.pdf I'm having trouble outputting to other PDFA formats (MarkInfo missing bla bla). I'll keep checking as soon as I can. In the meantime, here's a PDFA-1b. BTW: Regular Acrobat XI format is the same as Acrobat X (PDF Version 1.7 Adobe Extension Level 8) Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1252) Tika is not indexing all authors of a PDF
[ https://issues.apache.org/jira/browse/TIKA-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918450#comment-13918450 ] Alexandre Madurell commented on TIKA-1252: -- Hmmm... maybe I need to build a DublinCoreAdapter on top of Tika's Metadata class as mentioned here? http://lucene.472066.n3.nabble.com/Metadata-use-by-Apache-Java-projects-td645477.html#a645484 Kind of a newbie here... any help is appreciated. Tika is not indexing all authors of a PDF - Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1252) Tika is not indexing all authors of a PDF
Alexandre Madurell created TIKA-1252: Summary: Tika is not indexing all authors of a PDF Key: TIKA-1252 URL: https://issues.apache.org/jira/browse/TIKA-1252 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.4 Environment: Ubuntu 12.04 (x64) Solr 4.6.0 (Amazon Web Services, Bitnami Stack) Reporter: Alexandre Madurell When submitting a PDF with this information in its XMP metadata: ... dc:creator rdf:Bag rdf:liAuthor 1/rdf:li rdf:liAuthor 2/rdf:li /rdf:Bag /dc:creator ... Only the first one appears in the collection: ... author:[Author 1], author_s:Author 1, ... In spite of having set the field to multiValued in the Solr schema: field name=author type=text_general indexed=true stored=true multiValued=true/ Let me know if there's any further specific information I could provide. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.2#6252)