[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215855#comment-15215855 ] Tim Allison commented on TIKA-1285: --- I opened TIKA-1912 to track this issue. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214111#comment-15214111 ] Tim Allison commented on TIKA-1285: --- As I mentioned on the pdfbox dev list, I'm hesitant to waste your time by submitting issues for truncated files. If AR can't parse it, I wouldn't expect PDFBox to have much luck. However, the classic parser in 1.8 was able to get some text+metadata out of some truncated files. If you go to my last pre-release-2.0.0 reports zip here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2_0_20160310.zip?raw=true there's a file called textLostFromACausedByNewExceptionsInB.xlsx. That documents what text 1.8.11 (with the classic parser) was able to extract from files that 2.0.0 (with nonsequential parser) was not. By Nearly all of the "new" exceptions in 2.0.0 were caused by truncated files. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214107#comment-15214107 ] Tim Allison commented on TIKA-1285: --- Y, that's what I was thinking about doing with shading+relocating. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212052#comment-15212052 ] John Hewson commented on TIKA-1285: --- It would be better to open JIRA issues for problem PDFs so that we can improve the 2.0 parser. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212049#comment-15212049 ] John Hewson commented on TIKA-1285: --- The parser and the rest of PDFBox are tightly coupled, so it's not possible to switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the whole of PDFBox, which of course you could do. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208360#comment-15208360 ] Luis Filipe Nassif commented on TIKA-1285: -- If the PDFBox team could distribute a o.a.pdfbox18 that would be great! > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208296#comment-15208296 ] Tim Allison commented on TIKA-1285: --- Y, I've been thinking about this, too. I wonder if we could shade/relocate PDFBox 1.8 ourselves, or perhaps ask our PDFBox colleagues to distribute a shaded+relocated 1.8 (o.a.pdfbox18...) that we could call with PDFParser18 or something. If we can get the shading to work, this would be a perfect use case for the back-off composite parser (still in planning stages)-- if there's an exception with PDFBox 2.0.0, retry with PDFBox 1.8.x. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15208292#comment-15208292 ] Luis Filipe Nassif commented on TIKA-1285: -- Hi [~talli...@apache.org] There is any magic/recommendation to use both PDFBox 2.0 and 1.8 by the same app? Running ExtractText externally? There is a better way? I am still interested in parsing truncated and damaged pdf files... > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207398#comment-15207398 ] Tim Allison commented on TIKA-1285: --- 1.13...not sure of timeframe for that > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson > Fix For: 1.13 > > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207164#comment-15207164 ] Ben McCann commented on TIKA-1285: -- Thanks so much Tim! Do you know what Tika release this will be a part of? > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207056#comment-15207056 ] Hudson commented on TIKA-1285: -- UNSTABLE: Integrated in tika-2.x #57 (See [https://builds.apache.org/job/tika-2.x/57/]) TIKA-1285 -- upgrade PDFBox to 2.0.0 in 2.x (tallison: rev 7bc3eae94d79bbbf5dc50143c404af22c02446bc) * tika-parser-modules/tika-parser-pdf-module/pom.xml * tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java * tika-bundle/pom.xml * tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties * tika-parser-modules/pom.xml * tika-parser-modules/tika-parser-xmp-commons/pom.xml * tika-parser-bundles/tika-parser-pdf-bundle/pom.xml * tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java * CHANGES.txt * tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/image/ImageParserTest.java * tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206865#comment-15206865 ] Tim Allison commented on TIKA-1285: --- Finally, tika-devs, for the sake of tests, I followed PDFBox's test-scope inclusion of imageio: {code} com.github.jai-imageio jai-imageio-core 1.3.1 test {code} If we don't want to include this even in the test scope, I'm happy taking it out. We'll have to modify a unit test or two, but it will be trivial. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206855#comment-15206855 ] Tim Allison commented on TIKA-1285: --- We'll see what Hudson says, but I just pushed the mods to Tika's 2.x branch as well. A few notes: 1) XMPBox is currently designed to handle PDF/A. There were exceptions on roughly 40% of XMPs extracted from our test corpus. We'll stick with jempbox 1.8.x for now for XMP parsing. We may consider migrating to Adobe's xmpcore. If anyone wants to help make XMPBox more robust, that'd be a huge service. Ref: [this email|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56DF3F6F.8000201%40lehmi.de%3E] 2) PDFBox 2.0 has gotten rid of the classic parser, and now all parsing is done by the non-sequential parser. In my opinion, the PDFBox devs put a tremendous amount of work into making this new parser quite robust. However, for truncated or other truly damaged files, users may have some luck with the classic parser in 1.8.x. 3) PDFBox 2.0 no longer extracts tiff files. See [this exchange|https://mail-archives.apache.org/mod_mbox/pdfbox-dev/201507.mbox/%3c559cca2c.7050...@t-online.de%3e], and consider adding the optional dependencies to handle Tiffs, jpeg2000 and ... Other than those major points, in my opinion, PDFBox 2.0.0 should fix quite a few issues and is far more robust for bidi documents. Many thanks to the PDFBox devs, especially [~lehmi], [~msahyoun] and [~tilman], for their work on PDFBox and on their collaboration on the eval processmore work remains on the latter. :) > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205739#comment-15205739 ] Hudson commented on TIKA-1285: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #932 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/932/]) TIKA-1285 -- upgrade to PDFBox 2.0.0 -- for now turn off tests with (tallison: rev 9ebf066dd96783c952f4c2a37a2a02af2b0c5aa0) * tika-parsers/src/test/java/org/apache/tika/parser/image/ImageParserTest.java > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205676#comment-15205676 ] Hudson commented on TIKA-1285: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #931 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/931/]) TIKA-1285 -- upgrade to PDFBox 2.0.0 (tallison: rev 98eb56ec78f2e1d27de644f4f6647ea1cfbc930b) * tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java * tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java * tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties * tika-bundle/pom.xml * tika-parsers/pom.xml * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java * CHANGES.txt * tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15205425#comment-15205425 ] Tim Allison commented on TIKA-1285: --- PDFBox 2.0.0 was released this morning. Will upgrade Tika over the next few days. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971696#comment-14971696 ] Tim Allison commented on TIKA-1285: --- Finished comparison of ~100k docs: [here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip] > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954946#comment-14954946 ] Timo Boehme commented on TIKA-1285: --- Did you try using the new memory settings possibilities? You can define a maximum main memory usage for storing PDF streams and if more is required it can use a temporary file (see {{load(File file, MemoryUsageSetting memUsageSetting)}} in {{PDDocument}}). > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954947#comment-14954947 ] Tim Allison commented on TIKA-1285: --- Y, that's the first thing on my todo list on our wrapper -- integrate the MemoryUsageSetting, which is very, very cool. I should have a chance to add that by the end of this week, and then we'll see. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954860#comment-14954860 ] Tim Allison commented on TIKA-1285: --- Thank you for testing the dev wrapper and PDFBox 2.0, and thank you for the comments over on github. Out of curiosity, what type of testing did you do? How many docs? How did you compare, etc? My sense is that my Linux vm is killing the batch process quite a bit more often with 2.0 than with 1.8.x...because of memory issues. What type of load were you running? Did you see any memory issues? > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955248#comment-14955248 ] Ben McCann commented on TIKA-1285: -- I didn't really do any load or memory testing. My testing was focused on accuracy of converting pdfs to text on a few hundred documents. > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952137#comment-14952137 ] Ben McCann commented on TIKA-1285: -- I did a bunch of testing today. It works pretty much as well as 1.8 did. There was one issue which caused me some trouble which is that it seems to be inserting extraneous spaces. See https://issues.apache.org/jira/browse/PDFBOX-3019 > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950223#comment-14950223 ] Tim Allison commented on TIKA-1285: --- No problem at all...I still need to run against our batch as well. :( > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949845#comment-14949845 ] Arkady Zalkowitsch commented on TIKA-1285: -- Ok, I will do this tomorrow. I have project release today. =P Thanks a lot! > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943339#comment-14943339 ] Tim Allison commented on TIKA-1285: --- Thank you, [~b...@benmccann.com]! The more eyes we have on this the better for both projects. Updated working wrapper is available [here|https://github.com/tballison/tika/tree/pdfbox2_0]. Some clean up remains... [~arkadyzalko], would you be willing to run this on your batch of docs and let us know what you find? > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942248#comment-14942248 ] Tim Allison commented on TIKA-1285: --- Completely agree. If I update the PDFBox 2.0 branch of Tika on my github site, would you be willing to run tests on your documents? > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942382#comment-14942382 ] Ben McCann commented on TIKA-1285: -- Yeah, that'd be great > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14941850#comment-14941850 ] Arkady Zalkowitsch commented on TIKA-1285: -- I've opened an issue where the resolution should be done when you guys upgrade the PDFBox. https://issues.apache.org/jira/browse/PDFBOX-3004 Good luck ;) > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942086#comment-14942086 ] Ben McCann commented on TIKA-1285: -- I expect a Pdfbox 2.0 RC soon. There are only 5 issues still open marked as Fix Version 2.0 - https://issues.apache.org/jira/browse/PDFBOX-2883?jql=project%20%3D%20PDFBOX%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.0.0%20ORDER%20BY%20priority%20DESC It'd probably be worth testing against the latest pdfbox again now to be able to give them a heads up if there are any issues we know of > Upgrade to PDFBox 2.0.0 when available > -- > > Key: TIKA-1285 > URL: https://issues.apache.org/jira/browse/TIKA-1285 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.6 >Reporter: Jeremy Anderson >Priority: Minor > Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, > TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, > testPDF_childAttachments.pdf > > > This issue is to track fixes required when upgrading the PDFbox dependency to > 2.0.0 Final once it's available, and using PDFBox's daily build before then. > See TIKA-1268 comment. > Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633637#comment-14633637 ] jayesh commented on TIKA-1285: -- Any idea guys, when we can accomodate PDFBox2.0 with tika? Thanks. Upgrade to PDFBox 2.0.0 when available -- Key: TIKA-1285 URL: https://issues.apache.org/jira/browse/TIKA-1285 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Jeremy Anderson Priority: Minor Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, testPDF_childAttachments.pdf This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then. See TIKA-1268 comment. Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633643#comment-14633643 ] Tim Allison commented on TIKA-1285: --- Still hammering out some issues. If regression tests go well, I'd say a few weeks after PDFBox 2.0 is released. There's still quite a bit of important work on performance improvements that is going on on PDFBox. Are there specific features that 2.0 has that you need? Upgrade to PDFBox 2.0.0 when available -- Key: TIKA-1285 URL: https://issues.apache.org/jira/browse/TIKA-1285 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Jeremy Anderson Priority: Minor Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, testPDF_childAttachments.pdf This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then. See TIKA-1268 comment. Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633655#comment-14633655 ] jayesh commented on TIKA-1285: -- org.apache.fontbox.ttf.TrueTypeFont initializeTable SEVERE: An error occured when reading table hmtx java.io.EOFException org.apache.fontbox.util.FontManager findTTFontname WARNING: Font not found: Verdana After google, i found out that the above errors and other some errors were fixed in PDFBox 2.0. Hence was curious to know when that will be available in Tika. Upgrade to PDFBox 2.0.0 when available -- Key: TIKA-1285 URL: https://issues.apache.org/jira/browse/TIKA-1285 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Jeremy Anderson Priority: Minor Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, testPDF_childAttachments.pdf This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then. See TIKA-1268 comment. Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121991#comment-14121991 ] Jeremy Anderson commented on TIKA-1285: --- Updated patch to include fixes as of revision 1621674 on Sept 4th. Major fixes include syncing up to Snapshot of PDFBox post Jempbox replacement by XmpBox. XmpBox still requires some refinement to properly handle all of the XMP packages encountered by Tika's unit tests. Some of these cases have been commented out until DomXmpParser can resolve them. Issues are not yet reported in JIRA for PDFBOX as I'm not familiar on how to proceed for them. The common Dom Xmp Parser issues encountered: * Invalid array definition, expecting Alt and found nothing [prefix=dc; name=title] * Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator] * No type defined for {http://ns.adobe.com/pdf/1.3/}Trapped * Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/ * xmp should start with a processing instruction Patch works in conjunction with PDFBOX-2318 Upgrade to PDFBox 2.0.0 when available -- Key: TIKA-1285 URL: https://issues.apache.org/jira/browse/TIKA-1285 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Jeremy Anderson Priority: Minor Attachments: TIKA-1285.patch This issue is to track fixes required when upgrading the PDFbox dependency to 2.0.0 Final once it's available, and using PDFBox's daily build before then. See TIKA-1268 comment. Relates to PDFBOX-1893 -- This message was sent by Atlassian JIRA (v6.3.4#6332)