[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125384#comment-14125384 ] Andrew Jackson commented on TIKA-1232: -- Looks like this is fixed and in the 1.6 release - thank you. Can the 'Fix version' on this ticket be updated accordingly? Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623 ] William Palmer commented on TIKA-1232: -- I am currently out of the office and will be back on Thursday 26th June 2014. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040816#comment-14040816 ] Tyler Palsulich commented on TIKA-1232: --- Hey [~talli...@mitre.org]. A couple -- TIKA-758 will need an update (I put up a patch a few days ago corresponding to an older version upgrade, too) and the workaround in TIKA-1325 can be removed (since PDFBOX-2122 is resolved). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017879#comment-14017879 ] Johan van der Knijff commented on TIKA-1232: I'm currently away and unable to respond to your message. I will be back the 16th of June. Best regards, Johan Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930114#comment-13930114 ] William Palmer commented on TIKA-1232: -- Thanks Tim everyone Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922736#comment-13922736 ] Tim Allison commented on TIKA-1232: --- Fixed r1574959. Reopen if any tweaks remain to me made. Thank you, all, for your contributions! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698 ] Andrew Jackson commented on TIKA-1232: -- Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 5, so we'd need that to get the full set of recent versions. I'll have a dig around for suitable files for the versions that aren't covered yet, but most of the stuff I have access to is not re-licensable. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920741#comment-13920741 ] Tim Allison commented on TIKA-1232: --- That would be great! Yes, please make sure that your contributions are consistent with the Apache License 2.0. Thank you, [~alexandre.madur...@gmail.com] for all of your testing files! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919861#comment-13919861 ] Tim Allison commented on TIKA-1232: --- Also, if anyone can create or share some test files, that would be great. Thank you! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919863#comment-13919863 ] William Palmer commented on TIKA-1232: -- I am currently out of the office and will be back on Monday 11th March 2014. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908291#comment-13908291 ] William Palmer commented on TIKA-1232: -- Hi Tim Andy, Thanks - your code works on my test files. One question though - it appears that dc:format should be a mimetype, therefore should the Extended-Content-Type dc:format be an actual mimetype with version like application/pdf; version=A-1a, with A-1a overriding the pdf:PDFVersion 1.4? Thanks for this - much appreciated! Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908402#comment-13908402 ] Andrew Jackson commented on TIKA-1232: -- Going by my original intention, then I would prefer the one additional dc:format to be of the form: {code} application/pdf; version=1.4 application/pdf; version=A-1a application/pdf; version=1.7 Adobe Extension Level 3 {code} Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908407#comment-13908407 ] Tim Allison commented on TIKA-1232: --- Thank you. Will make change. Does anyone happen to have shareable pdfs to test the new metadata? Or, could someone fabricate some, perhaps? To the Tika community, are we ok with having multiple dc:formats in the metadata? Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: TIKA-1232v1.patch, pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895528#comment-13895528 ] Thomas Ledoux commented on TIKA-1232: - Regarding XMP ouput from tika and the inclusion of version, in the case of PDF, special ontologies are defined. Namely, in the http://wwwns.adobe.com/pdf/1.3/ namespace, there is a pdf:PDFVersion property. It can even be refined in the case of PDF/A where the conformance level can be given using the http://www.aiim.org/pdfa/ns/id/ namespace in the property pdfaid:conformance (see TN0008). There are similar properties pdfx:GTS_PDFXVersion and pdfx:GTS_PDFXConformance in the http://ns.adobe.com/pdfx/1.3 namespace for PDF/X files. However, all these properties are only available for PDF formats and will break the idea of having a generic metadata map exposed by tika. So I agree with Andrew proposal of using a version parameter in the mimetype, which is allowed in XMP. Indeed, the XMP definition of the value of dc:format is a MIMEType following IETF RFC 2045 section 5.1. Finally, in order to prevent the confusion of client code that Andrew raises, we could take advantage of the repeatability of the dc:format attribute and output 2 dc:formats : the first being the normal Content-Type and the second being the Extended-Content-Type. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894376#comment-13894376 ] Andrew Jackson commented on TIKA-1232: -- Great! For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox doesn't already do anything along those lines, but I am not all that familiar with that codebase so it's worth checking first. As for (2), I've only tested on a fairly small number of PDFs because only the more recent versions of the Adobe tools actually make use of them, and even then, only when necessary. I ran that code against a web archive corpus containing around 2 billion resources, including many millions of PDFs, but because that dataset only ran up to 2010, I found a grand total of eight PDFs that used Adobe Extension Level 3. It worked fine on those! Finally, on the metadata property scheme, I feel the 'right place' is as a parameter on the Content Type, but I accept that may confuse client code (i.e. people assuming type.equals(application/pdf) should always work, even though that would be no good for other types like HTML due to the charset parameter). Note that the parameter approach also allows you to do version detection in Tika's [custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357], which I find rather handy. Of course, you are also welcome to take any of those signatures if they are of interest. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380 ] Tim Allison commented on TIKA-1232: --- Interesting. Thank you, [~johanvanderknijff] and [~anjackson]. I personally like Extended-Content-Type, but following (http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar with Dublin Core and/or XMP who could recommend appropriate tags? Many apologies if either one of those recommends Extended-Content-Type :). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893426#comment-13893426 ] Tim Allison commented on TIKA-1232: --- [~anjackson], y, I'd like to add your code if others agree that it would be useful. No need for a formal patch. I'll take your github code nearly directly. Two items: 1) Would you be interested in contributing your extension-level extraction code to PDFBox if it doesn't currently exist there (I haven't checked but I assume you wouldn't reinvent the wheel). I think that would be more at home within PDFBox. 2) How much testing have you done for potential exceptions thrown by PDFBox on pdfs in the wild when grabbing this new metadata (cf. null pointer checks around date parsing in current metadata code and TIKA-1226, TIKA-1232, TIKA-1233)? Thank you, again. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892146#comment-13892146 ] Tim Allison commented on TIKA-1232: --- How about Application-Version to follow the deprecated example in org.apache.tika.metadata.MSOffice? Tika Community, Is there a more appropriate label for this? I didn't find anything relevant in TikaCoreProperties. Thank you. Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892149#comment-13892149 ] Johan van der Knijff commented on TIKA-1232: One thing to watch out for is that PDF has two places where you can define the version: the file header and, from PDF 1.4 onward, the catalog dictionary in the trailer. Both can be different (in which case the latter has precedence) See p. 39 of ISO 32000: http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf On top of that PDF 1.7 also adds Extension Levels (p.108), maybe those should be included as well? Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892210#comment-13892210 ] Andrew Jackson commented on TIKA-1232: -- Yes, you can't identify 1.7 PDF or the PDF/A variants unless you do a bit more parsing. In case it helps, here's the code I wrote to do that (and also extract other metadata of interest to me): https://github.com/openplanets/nanite/blob/master/nanite-ext/src/main/java/uk/bl/wa/tika/parser/pdf/pdfbox/PDFParser.java#L253 I couldn't do what I wanted by sub-classing the Tika code, so I copied the PDFParser and augmented it. If there is interest in taking this code into Tika I'd be willing to spend time turning it into a proper patch. As for how to record the result, this is definitely not the Application-Version. A modern version of Adobe Distiller can output various versions of PDF, because it chooses the version of the format based on the needs of the current document. i.e. if a document only requires PDF 1.4 features, it will output a PDF 1.4 and not just default to the latest version (AFAICT). My preference would be to use a version parameter on the content type. It's not a formally standardised approach, but has been adopted in a few places (e.g. [Java plugin versions|http://docs.oracle.com/javase/7/docs/technotes/guides/plugin/developer_guide/faq/basics.html#version]) In this case, you'd have something like: {quote} application/pdf; version=1.4 application/pdf; version=1.7 Adobe Extension Level 5 etc... {quote} although to avoid causing trouble for code that relies on the 'Content-Type' property, I have so far chosen to use a new property for this purpose (called 'Extended-Content-Type'). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: pdfversion.patch I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)