[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-29 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215855#comment-15215855
 ] 

Tim Allison commented on TIKA-1285:
---

I opened TIKA-1912 to track this issue.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214111#comment-15214111
 ] 

Tim Allison commented on TIKA-1285:
---

As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by 
submitting issues for truncated files.  If AR can't parse it, I wouldn't expect 
PDFBox to have much luck.  

However, the classic parser in 1.8 was able to get some text+metadata out of 
some truncated files.

If you go to my last pre-release-2.0.0 reports zip here: 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true

there's a file called textLostFromACausedByNewExceptionsInB.xlsx.  That 
documents what text 1.8.11 (with the classic parser) was able to extract from 
files that 2.0.0 (with nonsequential parser) was not.  By Nearly all of the 
"new" exceptions in 2.0.0 were caused by truncated files.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214107#comment-15214107
 ] 

Tim Allison commented on TIKA-1285:
---

Y, that's what I was thinking about doing with shading+relocating.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-25 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212052#comment-15212052
 ] 

John Hewson commented on TIKA-1285:
---

It would be better to open JIRA issues for problem PDFs so that we can improve 
the 2.0 parser.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-25 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212049#comment-15212049
 ] 

John Hewson commented on TIKA-1285:
---

The parser and the rest of PDFBox are tightly coupled, so it's not possible to 
switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the 
whole of PDFBox, which of course you could do.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-23 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208360#comment-15208360
 ] 

Luis Filipe Nassif commented on TIKA-1285:
--

If the PDFBox team could distribute a o.a.pdfbox18 that would be great!

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208296#comment-15208296
 ] 

Tim Allison commented on TIKA-1285:
---

Y, I've been thinking about this, too.  I wonder if we could shade/relocate 
PDFBox 1.8 ourselves, or perhaps ask our PDFBox colleagues to distribute a 
shaded+relocated 1.8 (o.a.pdfbox18...) that we could call with PDFParser18 or 
something.

If we can get the shading to work, this would be a perfect use case for the 
back-off composite parser (still in planning stages)-- if there's an exception 
with PDFBox 2.0.0, retry with PDFBox 1.8.x.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-23 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208292#comment-15208292
 ] 

Luis Filipe Nassif commented on TIKA-1285:
--

Hi [~talli...@apache.org]

There is any magic/recommendation to use both PDFBox 2.0 and 1.8 by the same 
app? Running ExtractText externally? There is a better way? I am still 
interested in parsing truncated and damaged pdf files...

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207398#comment-15207398
 ] 

Tim Allison commented on TIKA-1285:
---

1.13...not sure of timeframe for that

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-22 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207164#comment-15207164
 ] 

Ben McCann commented on TIKA-1285:
--

Thanks so much Tim! Do you know what Tika release this will be a part of?

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207056#comment-15207056
 ] 

Hudson commented on TIKA-1285:
--

UNSTABLE: Integrated in tika-2.x #57 (See 
[https://builds.apache.org/job/tika-2.x/57/])
TIKA-1285 -- upgrade PDFBox to 2.0.0 in 2.x (tallison: rev 
7bc3eae94d79bbbf5dc50143c404af22c02446bc)
* tika-parser-modules/tika-parser-pdf-module/pom.xml
* 
tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java
* tika-bundle/pom.xml
* 
tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* tika-parser-modules/pom.xml
* tika-parser-modules/tika-parser-xmp-commons/pom.xml
* tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
* 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java
* CHANGES.txt
* 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java
* 
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206865#comment-15206865
 ] 

Tim Allison commented on TIKA-1285:
---

Finally, tika-devs, for the sake of tests, I followed PDFBox's test-scope 
inclusion of imageio:
{code}



  com.github.jai-imageio
  jai-imageio-core
  1.3.1
  test

{code}

If we don't want to include this even in the test scope, I'm happy taking it 
out.  We'll have to modify a unit test or two, but it will be trivial.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-22 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206855#comment-15206855
 ] 

Tim Allison commented on TIKA-1285:
---

We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as 
well.

A few notes:
1) XMPBox is currently designed to handle PDF/A.  There were exceptions on 
roughly 40% of XMPs extracted from our test corpus.  We'll stick with jempbox 
1.8.x for now for XMP parsing.  We may consider migrating to Adobe's xmpcore.  
If anyone wants to help make XMPBox more robust, that'd be a huge service.  
Ref: [this 
email|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56DF3F6F.8000201%40lehmi.de%3E]

2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done 
by the non-sequential parser.  In my opinion, the PDFBox devs put a tremendous 
amount of work  into making this new parser quite robust.  However, for 
truncated or other truly damaged files, users may have some luck with the 
classic parser in 1.8.x.

3) PDFBox 2.0 no longer extracts tiff files. See [this 
exchange|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201507.mbox/%3c559cca2c.7050...@t-online.de%3e],
 and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ...

Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a 
few issues and is far more robust for bidi documents.

Many thanks to the PDFBox devs, especially [~lehmi], [~msahyoun] and [~tilman], 
for their work on PDFBox and on their collaboration on the eval processmore 
work remains on the latter. :)

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205739#comment-15205739
 ] 

Hudson commented on TIKA-1285:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #932 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/932/])
TIKA-1285 -- upgrade to PDFBox 2.0.0 -- for now turn off tests with (tallison: 
rev 9ebf066dd96783c952f4c2a37a2a02af2b0c5aa0)
* tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java


> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205676#comment-15205676
 ] 

Hudson commented on TIKA-1285:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #931 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/931/])
TIKA-1285 -- upgrade to PDFBox 2.0.0 (tallison: rev 
98eb56ec78f2e1d27de644f4f6647ea1cfbc930b)
* tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java
* 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* tika-bundle/pom.xml
* tika-parsers/pom.xml
* 
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* CHANGES.txt
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java


> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-21 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205425#comment-15205425
 ] 

Tim Allison commented on TIKA-1285:
---

PDFBox 2.0.0 was released this morning.  Will upgrade Tika over the next few 
days.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971696#comment-14971696
 ] 

Tim Allison commented on TIKA-1285:
---

Finished comparison of ~100k docs: 
[here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip]

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Timo Boehme (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954946#comment-14954946
 ] 

Timo Boehme commented on TIKA-1285:
---

Did you try using the new memory settings possibilities? You can define a 
maximum main memory usage for storing PDF streams and if more is required it 
can use a temporary file (see {{load(File file, MemoryUsageSetting 
memUsageSetting)}} in {{PDDocument}}).

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954947#comment-14954947
 ] 

Tim Allison commented on TIKA-1285:
---

Y, that's the first thing on my todo list on our wrapper -- integrate the 
MemoryUsageSetting, which is very, very cool. I should have a chance to add 
that by the end of this week, and then we'll see.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954860#comment-14954860
 ] 

Tim Allison commented on TIKA-1285:
---

Thank you for testing the dev wrapper and PDFBox 2.0, and thank you for the 
comments over on github.

Out of curiosity, what type of testing did you do?  How many docs?  How did you 
compare, etc?

My sense is that my Linux vm is killing the batch process quite a bit more 
often with 2.0 than with 1.8.x...because of memory issues.

What type of load were you running?  Did you see any memory issues?

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-13 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955248#comment-14955248
 ] 

Ben McCann commented on TIKA-1285:
--

I didn't really do any load or memory testing. My testing was focused on 
accuracy of converting pdfs to text on a few hundred documents.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-10 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952137#comment-14952137
 ] 

Ben McCann commented on TIKA-1285:
--

I did a bunch of testing today. It works pretty much as well as 1.8 did. There 
was one issue which caused me some trouble which is that it seems to be 
inserting extraneous spaces. See 
https://issues.apache.org/jira/browse/PDFBOX-3019

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950223#comment-14950223
 ] 

Tim Allison commented on TIKA-1285:
---

No problem at all...I still need to run against our batch as well. :(

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-08 Thread Arkady Zalkowitsch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949845#comment-14949845
 ] 

Arkady Zalkowitsch commented on TIKA-1285:
--

Ok, I will do this tomorrow. I have project release today. =P
Thanks a lot!

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-05 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943339#comment-14943339
 ] 

Tim Allison commented on TIKA-1285:
---

Thank you, [~b...@benmccann.com]!  The more eyes we have on this the better for 
both projects.

Updated working wrapper is available 
[here|https://github.com/tballison/tika/tree/pdfbox2_0].  Some clean up 
remains...

[~arkadyzalko], would you be willing to run this on your batch of docs and let 
us know what you find?

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942248#comment-14942248
 ] 

Tim Allison commented on TIKA-1285:
---

Completely agree. If I update the PDFBox 2.0 branch of Tika on my github site, 
would you be willing to run tests on your documents?

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-03 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942382#comment-14942382
 ] 

Ben McCann commented on TIKA-1285:
--

Yeah, that'd be great

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-02 Thread Arkady Zalkowitsch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941850#comment-14941850
 ] 

Arkady Zalkowitsch commented on TIKA-1285:
--

I've opened an issue where the resolution should be done when you guys upgrade 
the PDFBox.
https://issues.apache.org/jira/browse/PDFBOX-3004

Good luck ;)

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-10-02 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942086#comment-14942086
 ] 

Ben McCann commented on TIKA-1285:
--

I expect a Pdfbox 2.0 RC soon. There are only 5 issues still open marked as Fix 
Version 2.0 - 
https://issues.apache.org/jira/browse/PDFBOX-2883?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC

It'd probably be worth testing against the latest pdfbox again now to be able 
to give them a heads up if there are any issues we know of

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
>Priority: Minor
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-20 Thread jayesh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633637#comment-14633637
 ] 

jayesh commented on TIKA-1285:
--

Any idea guys, when we can accomodate PDFBox2.0 with tika?

Thanks.

 Upgrade to PDFBox 2.0.0 when available
 --

 Key: TIKA-1285
 URL: https://issues.apache.org/jira/browse/TIKA-1285
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Jeremy Anderson
Priority: Minor
 Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
 TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
 testPDF_childAttachments.pdf


 This issue is to track fixes required when upgrading the PDFbox dependency to 
 2.0.0 Final once it's available, and using PDFBox's daily build before then.
 See TIKA-1268 comment.
 Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-20 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633643#comment-14633643
 ] 

Tim Allison commented on TIKA-1285:
---

Still hammering out some issues. If regression tests go well, I'd say a few 
weeks after PDFBox 2.0 is released.  There's still quite a bit of important 
work on performance improvements that is going on on PDFBox. 

Are there specific features that 2.0 has that you need?

 Upgrade to PDFBox 2.0.0 when available
 --

 Key: TIKA-1285
 URL: https://issues.apache.org/jira/browse/TIKA-1285
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Jeremy Anderson
Priority: Minor
 Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
 TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
 testPDF_childAttachments.pdf


 This issue is to track fixes required when upgrading the PDFbox dependency to 
 2.0.0 Final once it's available, and using PDFBox's daily build before then.
 See TIKA-1268 comment.
 Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2015-07-20 Thread jayesh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633655#comment-14633655
 ] 

jayesh commented on TIKA-1285:
--

org.apache.fontbox.ttf.TrueTypeFont initializeTable
SEVERE: An error occured when reading table hmtx
java.io.EOFException


org.apache.fontbox.util.FontManager findTTFontname
WARNING: Font not found: Verdana

After google, i found out that the above errors and other some errors were 
fixed in PDFBox 2.0. 
Hence was curious to know when that will be available in Tika.

 Upgrade to PDFBox 2.0.0 when available
 --

 Key: TIKA-1285
 URL: https://issues.apache.org/jira/browse/TIKA-1285
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Jeremy Anderson
Priority: Minor
 Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
 TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
 testPDF_childAttachments.pdf


 This issue is to track fixes required when upgrading the PDFbox dependency to 
 2.0.0 Final once it's available, and using PDFBox's daily build before then.
 See TIKA-1268 comment.
 Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2014-09-04 Thread Jeremy Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121991#comment-14121991
 ] 

Jeremy Anderson commented on TIKA-1285:
---

Updated patch to include fixes as of revision 1621674 on Sept 4th.  Major fixes 
include syncing up to Snapshot of PDFBox post Jempbox replacement by XmpBox.

XmpBox still requires some refinement to properly handle all of the XMP 
packages encountered by Tika's unit tests.  Some of these cases have been 
commented out until DomXmpParser can resolve them.

Issues are not yet reported in JIRA for PDFBOX as I'm not familiar on how to 
proceed for them.  The common Dom Xmp Parser issues encountered:
* Invalid array definition, expecting Alt and found nothing [prefix=dc; 
name=title]
* Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
* No type defined for {http://ns.adobe.com/pdf/1.3/}Trapped
* Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/
* xmp should start with a processing instruction


Patch works in conjunction with PDFBOX-2318


 Upgrade to PDFBox 2.0.0 when available
 --

 Key: TIKA-1285
 URL: https://issues.apache.org/jira/browse/TIKA-1285
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Jeremy Anderson
Priority: Minor
 Attachments: TIKA-1285.patch


 This issue is to track fixes required when upgrading the PDFbox dependency to 
 2.0.0 Final once it's available, and using PDFBox's daily build before then.
 See TIKA-1268 comment.
 Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)