[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-09-08 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125384#comment-14125384
 ] 

Andrew Jackson commented on TIKA-1232:
--

Looks like this is fixed and in the 1.6 release - thank you. Can the 'Fix 
version' on this ticket be updated accordingly?

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-23 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623
 ] 

William Palmer commented on TIKA-1232:
--

I am currently out of the office and will be back on Thursday 26th June 2014.

Any FOI requests should be sent to foi-enquir...@bl.uk.


**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040816#comment-14040816
 ] 

Tyler Palsulich commented on TIKA-1232:
---

Hey [~talli...@mitre.org]. A couple -- TIKA-758 will need an update (I put up a 
patch a few days ago corresponding to an older version upgrade, too) and the 
workaround in TIKA-1325 can be removed (since PDFBOX-2122 is resolved). 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-04 Thread Johan van der Knijff (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017879#comment-14017879
 ] 

Johan van der Knijff commented on TIKA-1232:


I'm currently away and unable to respond to your message. I will be back the 
16th of June.

Best regards,

Johan



 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-11 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13930114#comment-13930114
 ] 

William Palmer commented on TIKA-1232:
--

Thanks Tim  everyone

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922736#comment-13922736
 ] 

Tim Allison commented on TIKA-1232:
---

Fixed r1574959.  Reopen if any tweaks remain to me made.  Thank you, all, for 
your contributions!

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920698#comment-13920698
 ] 

Andrew Jackson commented on TIKA-1232:
--

Does anyone have a copy of Acrobat 9.1? That version uses Adobe Extension Level 
5, so we'd need that to get the full set of recent versions. I'll have a dig 
around for suitable files for the versions that aren't covered yet, but most of 
the stuff I have access to is not re-licensable.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920741#comment-13920741
 ] 

Tim Allison commented on TIKA-1232:
---

That would be great!  Yes, please make sure that your contributions are 
consistent with the Apache License 2.0.  Thank you, 
[~alexandre.madur...@gmail.com] for all of your testing files! 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919861#comment-13919861
 ] 

Tim Allison commented on TIKA-1232:
---

Also, if anyone can create or share some test files, that would be great.  
Thank you!

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-03-04 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919863#comment-13919863
 ] 

William Palmer commented on TIKA-1232:
--

I am currently out of the office and will be back on Monday 11th March 2014.

Any FOI requests should be sent to foi-enquir...@bl.uk.



**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908291#comment-13908291
 ] 

William Palmer commented on TIKA-1232:
--

Hi Tim  Andy,

Thanks - your code works on my test files.  One question though - it appears 
that dc:format should be a mimetype, therefore should the Extended-Content-Type 
dc:format be an actual mimetype with version like application/pdf; 
version=A-1a, with A-1a overriding the pdf:PDFVersion 1.4?  

Thanks for this - much appreciated!

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908402#comment-13908402
 ] 

Andrew Jackson commented on TIKA-1232:
--

Going by my original intention, then I would prefer the one additional 
dc:format to be of the form:

{code}
application/pdf; version=1.4
application/pdf; version=A-1a
application/pdf; version=1.7 Adobe Extension Level 3
{code}

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908407#comment-13908407
 ] 

Tim Allison commented on TIKA-1232:
---

Thank you.  Will make change.  Does anyone happen to have shareable pdfs to 
test the new metadata?  Or, could someone fabricate some, perhaps?

To the Tika community, are we ok with having multiple dc:formats in the 
metadata?

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: TIKA-1232v1.patch, pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-08 Thread Thomas Ledoux (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895528#comment-13895528
 ] 

Thomas Ledoux commented on TIKA-1232:
-

Regarding XMP ouput from tika and the inclusion of version, in the case of PDF, 
special ontologies are defined.
Namely, in the http://wwwns.adobe.com/pdf/1.3/ namespace, there is a 
pdf:PDFVersion property.
It can even be refined in the case of PDF/A where the conformance level can be 
given using the http://www.aiim.org/pdfa/ns/id/ namespace in the property 
pdfaid:conformance (see TN0008). There are similar properties 
pdfx:GTS_PDFXVersion and pdfx:GTS_PDFXConformance in the 
http://ns.adobe.com/pdfx/1.3 namespace for PDF/X files.

However, all these properties are only available for PDF formats and will break 
the idea of having a generic metadata map exposed by tika.
So I agree with Andrew proposal of using a version parameter in the mimetype, 
which is allowed in XMP.
Indeed, the XMP definition of the value of dc:format is a MIMEType following 
IETF RFC 2045 section 5.1. 

Finally, in order to prevent the confusion of client code that Andrew raises, 
we could take advantage of the repeatability of the dc:format attribute and 
output 2 dc:formats : the first being the normal Content-Type and the second 
being the Extended-Content-Type.


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-07 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894376#comment-13894376
 ] 

Andrew Jackson commented on TIKA-1232:
--

Great!

For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox 
doesn't already do anything along those lines, but I am not all that familiar 
with that codebase so it's worth checking first.

As for (2), I've only tested on a fairly small number of PDFs because only the 
more recent versions of the Adobe tools actually make use of them, and even 
then, only when necessary. I ran that code against a web archive corpus 
containing around 2 billion resources, including many millions of PDFs, but 
because that dataset only ran up to 2010, I found a grand total of eight PDFs 
that used Adobe Extension Level 3. It worked fine on those!

Finally, on the metadata property scheme, I feel the 'right place' is as a 
parameter on the Content Type, but I accept that may confuse client code (i.e. 
people assuming type.equals(application/pdf) should always work, even though 
that would be no good for other types like HTML due to the charset parameter). 

Note that the parameter approach also allows you to do version detection in 
Tika's 
[custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357],
 which I find rather handy. Of course, you are also welcome to take any of 
those signatures if they are of interest.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893380#comment-13893380
 ] 

Tim Allison commented on TIKA-1232:
---

Interesting.  Thank you, [~johanvanderknijff] and [~anjackson].  I personally 
like Extended-Content-Type, but following 
(http://wiki.apache.org/tika/MetadataRoadmap), is there someone more familiar 
with Dublin Core and/or XMP who could recommend appropriate tags?  Many 
apologies if either one of those recommends Extended-Content-Type :).

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893426#comment-13893426
 ] 

Tim Allison commented on TIKA-1232:
---

[~anjackson], y, I'd like to add your code if others agree that it would be 
useful.  No need for a formal patch.  I'll take your github code nearly 
directly.

Two items:
  1) Would you be interested in contributing your extension-level extraction 
code to PDFBox if it doesn't currently exist there (I haven't checked but I 
assume you wouldn't reinvent the wheel).  I think that would be more at home 
within PDFBox.
  2) How much testing have you done for potential exceptions thrown by PDFBox 
on pdfs in the wild when grabbing this new metadata (cf. null pointer checks 
around date parsing in current metadata code and TIKA-1226, TIKA-1232, 
TIKA-1233)?

Thank you, again.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892146#comment-13892146
 ] 

Tim Allison commented on TIKA-1232:
---

How about Application-Version to follow the deprecated example in 
org.apache.tika.metadata.MSOffice?

Tika Community,
  Is there a more appropriate label for this?  I didn't find anything relevant 
in TikaCoreProperties.  Thank you.

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Johan van der Knijff (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892149#comment-13892149
 ] 

Johan van der Knijff commented on TIKA-1232:


One thing to watch out for is that PDF has two places where you can define the 
version: the file header and, from PDF 1.4 onward, the catalog dictionary  in 
the trailer. Both can be different (in which case the latter has precedence) 
See p. 39 of ISO 32000: 

http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

On top of that PDF 1.7 also adds Extension Levels (p.108), maybe those should 
be included as well?


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-02-05 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892210#comment-13892210
 ] 

Andrew Jackson commented on TIKA-1232:
--

Yes, you can't identify  1.7 PDF or the PDF/A variants unless you do a bit 
more parsing. In case it helps, here's the code I wrote to do that (and also 
extract other metadata of interest to me):

https://github.com/openplanets/nanite/blob/master/nanite-ext/src/main/java/uk/bl/wa/tika/parser/pdf/pdfbox/PDFParser.java#L253

I couldn't do what I wanted by sub-classing the Tika code, so I copied the 
PDFParser and augmented it. If there is interest in taking this code into Tika 
I'd be willing to spend time turning it into a proper patch.

As for how to record the result, this is definitely not the 
Application-Version. A modern version of Adobe Distiller can output various 
versions of PDF, because it chooses the version of the format based on the 
needs of the current document. i.e. if a document only requires PDF 1.4 
features, it will output a PDF 1.4 and not just default to the latest version 
(AFAICT).

My preference would be to use a version parameter on the content type. It's not 
a formally standardised approach, but has been adopted in a few places (e.g. 
[Java plugin 
versions|http://docs.oracle.com/javase/7/docs/technotes/guides/plugin/developer_guide/faq/basics.html#version])

In this case, you'd have something like:

{quote}
application/pdf; version=1.4
application/pdf; version=1.7 Adobe Extension Level 5
etc...
{quote}

although to avoid causing trouble for code that relies on the 'Content-Type' 
property, I have so far chosen to use a new property for this purpose (called 
'Extended-Content-Type'). 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: pdfversion.patch


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)