[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892210#comment-13892210 ]
Andrew Jackson commented on TIKA-1232: -------------------------------------- Yes, you can't identify > 1.7 PDF or the PDF/A variants unless you do a bit more parsing. In case it helps, here's the code I wrote to do that (and also extract other metadata of interest to me): https://github.com/openplanets/nanite/blob/master/nanite-ext/src/main/java/uk/bl/wa/tika/parser/pdf/pdfbox/PDFParser.java#L253 I couldn't do what I wanted by sub-classing the Tika code, so I copied the PDFParser and augmented it. If there is interest in taking this code into Tika I'd be willing to spend time turning it into a proper patch. As for how to record the result, this is definitely not the Application-Version. A modern version of Adobe Distiller can output various versions of PDF, because it chooses the version of the format based on the needs of the current document. i.e. if a document only requires PDF 1.4 features, it will output a PDF 1.4 and not just default to the latest version (AFAICT). My preference would be to use a version parameter on the content type. It's not a formally standardised approach, but has been adopted in a few places (e.g. [Java plugin versions|http://docs.oracle.com/javase/7/docs/technotes/guides/plugin/developer_guide/faq/basics.html#version]) In this case, you'd have something like: {quote} application/pdf; version=1.4 application/pdf; version="1.7 Adobe Extension Level 5" etc... {quote} although to avoid causing trouble for code that relies on the 'Content-Type' property, I have so far chosen to use a new property for this purpose (called 'Extended-Content-Type'). > Add PDF version to PDFParser output > ----------------------------------- > > Key: TIKA-1232 > URL: https://issues.apache.org/jira/browse/TIKA-1232 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.5 > Environment: JDK6 > Reporter: William Palmer > Assignee: Tim Allison > Priority: Minor > Attachments: pdfversion.patch > > > I'd like to identify the PDF version of files, this is not currently reported > by the PDFParser although the information is available via PDFBox. I have > attached a patch that adds the format version to the Metadata object. > However, I am not familiar enough with the Tika source to know if an > alternative metadata key should be used, or this new one added. > Comments welcome. -- This message was sent by Atlassian JIRA (v6.1.5#6160)