[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098412#comment-15098412 ] Tilman Hausherr commented on TIKA-1830: --- Another possibility is that the change I mentioned has different implications depending on what JDK is used. Btw these files don't have errors with the non sequential parser. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096866#comment-15096866 ] Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:05 PM: I can't reproduce the difference for the file 074531.pdf. ExtractText returns identical results, that makes me doubt on the entire test :-( (edit: also 362980.pdf, 058103.pdf, and 760707.pdf ) I can reproduce the difference for 290377.pdf, this is because of a change in decompression (rev 1709182) that tries to squeeze as much as possible from corrupt streams. There may be some differences due to a bugfix related to "article beads". This will mean improved results for files with correct beads, but worse results for files where bead rectangles are incorrect. was (Author: tilman): I can't reproduce the difference for the file 074531.pdf. ExtractText returns identical results, that makes me doubt on the entire test :-( I can reproduce the difference for 290377.pdf, this is because of a change in decompression (rev 1709182) that tries to squeeze as much as possible from corrupt streams. There may be some differences due to a bugfix related to "article beads". This will mean improved results for files with correct beads, but worse results for files where bead rectangles are incorrect. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098465#comment-15098465 ] Tim Allison edited comment on TIKA-1830 at 1/14/16 5:50 PM: Y, 074531.pdf has uncovered a Tika issue. I can reproduce the exception with {{Tika.getInputStream()}}, but there is no problem if I call {{new FileInputStream}} or {{Files.newInputStream()}}. Were there any changes to stream manipulation...mark/reset etc in 1.8.11 vs 1.8.10? I confirmed that {{Tika.getInputStream}} works with 1.8.10 but not 1.8.11interesting... Also confirmed that this problem does not happen in 1.8.11 with the NonSequentialParser...only with the classic parser. was (Author: talli...@mitre.org): Y, 074531.pdf has uncovered a Tika issue. I can reproduce the exception with {{Tika.getInputStream()}}, but there is no problem if I call {{new FileInputStream}} or {{Files.newInputStream()}}. Were there any changes to stream manipulation...mark/reset etc in 1.8.11 vs 1.8.10? I confirmed that {{Tika.getInputStream}} works with 1.8.10 but not 1.8.11interesting... > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098515#comment-15098515 ] Tim Allison commented on TIKA-1830: --- Doh. Right. Thank you. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098465#comment-15098465 ] Tim Allison edited comment on TIKA-1830 at 1/14/16 5:37 PM: Y, 074531.pdf has uncovered a Tika issue. I can reproduce the exception with {{Tika.getInputStream()}}, but there is no problem if I call {{new FileInputStream}} or {{Files.newInputStream()}}. Were there any changes to stream manipulation...mark/reset etc in 1.8.11 vs 1.8.10? I confirmed that {{Tika.getInputStream}} works with 1.8.10 but not 1.8.11interesting... was (Author: talli...@mitre.org): Y, 074531.pdf has uncovered a Tika issue. I can reproduce the exception with Tika.getInputStream(), but there is no problem if I call new FileInputStream or Files.newInputStream(). > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098465#comment-15098465 ] Tim Allison commented on TIKA-1830: --- Y, 074531.pdf has uncovered a Tika issue. I can reproduce the exception with Tika.getInputStream(), but there is no problem if I call new FileInputStream or Files.newInputStream(). > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098503#comment-15098503 ] Tilman Hausherr commented on TIKA-1830: --- Not that, but the change I mentioned https://svn.apache.org/viewvc?view=revision=date=1709182 may play a role. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Tika questions on StackOverflow
On Wed, 13 Jan 2016, Allison, Timothy B. wrote: Are there other consumer lists we should be following? Elastic Search? I think Elastic Search only has a forum-type thingy, this probably should let you see Tika posts there (not that frequent) https://discuss.elastic.co/search?q=tika%20category%3A6%20order%3Alatest Otherwise Alfresco, Nutch and StormCrawler are probably the next biggest open source users, I guess? Nick
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097884#comment-15097884 ] Nick Burch commented on TIKA-1824: -- Tika already supports using a custom classloader for loading parser + detector classes + spi files - http://tika.apache.org/1.11/api/org/apache/tika/config/TikaConfig.html#TikaConfig%28java.lang.ClassLoader%29 > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098112#comment-15098112 ] Tim Allison commented on TIKA-1830: --- Argh...I'll rerun the 1.8.10 batch and see what we get. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098427#comment-15098427 ] Tim Allison commented on TIKA-1830: --- I just tested casting a null object that started life as a null String, and it seems not to throw an NPE. This is probably a Tika issue. I can replicate the exception via our commandline but the file works fine in our GUI...bizarre... More digging... > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098393#comment-15098393 ] Tim Allison commented on TIKA-1830: --- Finished the rerun...and the results look the same. Question: On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure that that affects 1.8.10? The discovery of that wouldn't have happened unless I was actually running 1.8.11. In 1.8.10, 074531.pdf has ~30k words. When I run 1.8.11 as a unit test within our PDFParser wrapper, I also get ~30k words. However, when I rerun our batch wrapper around 1.8.11 on this file, I get the same exception in a rerun as I did in the original run (reported in the reports attached yesterday). The exception is: {noformat} java.lang.NullPointerException at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1077) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275 at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:276) at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:49) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:193) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:205) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:256) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:236) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:216) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:471) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:395) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:354) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) {noformat} I get the same exception when I run this in our batch code with 1 consumer or 10 consumers...so it isn't a multithreading issuehwill dig some more. As a side note, I thought I wasn't comparing contents if there was an exception in one of the files...I need to fix my SQL to make sure this is the case. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098401#comment-15098401 ] Tilman Hausherr commented on TIKA-1830: --- {quote} On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure that that affects 1.8.10? The discovery of that wouldn't have happened unless I was actually running 1.8.11. {quote} Indeed, sorry. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098401#comment-15098401 ] Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:02 PM: {quote} On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure that that affects 1.8.10? The discovery of that wouldn't have happened unless I was actually running 1.8.11. {quote} Indeed, sorry. Fixed. was (Author: tilman): {quote} On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure that that affects 1.8.10? The discovery of that wouldn't have happened unless I was actually running 1.8.11. {quote} Indeed, sorry. > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available
[ https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098418#comment-15098418 ] Tilman Hausherr commented on TIKA-1830: --- The line at {{BaseParser.java:1077}} is {code} COSInteger number = (COSInteger)po.remove( po.size() -1 ); {code} po is never null, it is created earlier. Or would there be an NPE if {{po.remove}} returns null? > Upgrade to PDFBox 1.8.11 when available > --- > > Key: TIKA-1830 > URL: https://issues.apache.org/jira/browse/TIKA-1830 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison > Attachments: reports_pdfbox_1_8_11-rc1.zip > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
WMF extraction
Hi, POI will have a WMF module (org.apache.poi.hwmf.*) in the next beta. Looking over the govdocs collection, those embedded wmfs might contain interesting information for TIKA. Although my main goal is to integrate the rendering for common sl, it shouldn't be to laborious to provide something afterwards. Should the output be part of the embedding document, e.g. ppt, or does it make sense to crawl over various extensions and extract those metadata separately? (I haven't checked how the parsers are called, so this might be nonsense ...) Andi
[GitHub] tika pull request: tika_2.x
GitHub user kulkarniachyut opened a pull request: https://github.com/apache/tika/pull/70 tika_2.x test You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/tika 2.x Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/70.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #70 commit 64f606e1d5154154673df62e2067e35ee5026087 Author: Bob PaulinDate: 2015-12-01T03:47:16Z Created 2.x Branch git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717371 13f79535-47bb-0310-9956-ffa450edef68 commit 1034ba7605d37bc80839282852f3569ee08346f3 Author: Bob Paulin Date: 2015-12-01T04:08:25Z Moved Versions to 2.0-SNAPSHOT git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717374 13f79535-47bb-0310-9956-ffa450edef68 commit 04bd7e34d26b25f5108cab4683acc2b60d96b848 Author: Nick Burch Date: 2015-12-01T23:58:32Z Change the default LoadErrorHandler for Tika 2.x to be warn (TIKA-1805) git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717557 13f79535-47bb-0310-9956-ffa450edef68 commit f49f155a2bd5688a5c88b18bf19d9b7a2c9bd1de Author: Nick Burch Date: 2015-12-02T00:33:37Z Change what CLIRR checks against - we expect breakages vs Tika Core 1.0, that is why it is 2.0! git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717559 13f79535-47bb-0310-9956-ffa450edef68 commit cd50cd1843f12c1fc08af26d4ba5cf6d19c3452f Author: Nick Burch Date: 2015-12-02T00:33:41Z TIKA-1805 Notify via LoadErrorHandler if DefaultParser or DefaultDetector could not find any implementations of their service classes git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717560 13f79535-47bb-0310-9956-ffa450edef68 commit 5ab8e2b3e5fb66d5f1c38c6536742dfbe71d564d Author: Bob Paulin Date: 2015-12-03T23:48:49Z TIKA-1807 - Adding PAX-Exam to parent to allow standard test framework versions. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1717881 13f79535-47bb-0310-9956-ffa450edef68 commit f28173035134ebba1e1696cfb225f53b258e8af6 Author: Bob Paulin Date: 2015-12-13T19:34:17Z TIKA-1809 Enhanced Tika OSGi Service with test for core bundle. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1719820 13f79535-47bb-0310-9956-ffa450edef68 commit cc4a2bb1a924336d0fd7f79013bcab64588c8d13 Author: Bob Paulin Date: 2015-12-13T19:35:33Z TIKA-1810 - Tika Parser Module Parent POM git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1719822 13f79535-47bb-0310-9956-ffa450edef68 commit db9c6c28985b468af9ee8128e6402f9842f88048 Author: Bob Paulin Date: 2015-12-13T19:37:44Z TIKA-1811 - Tika Multimedia Module git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1719824 13f79535-47bb-0310-9956-ffa450edef68 commit 19c8259be0fbb38e2953a97ecf300da76bb2bfab Author: Bob Paulin Date: 2015-12-13T19:40:03Z TIKA-1811 - Fixed ignores for svn. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1719825 13f79535-47bb-0310-9956-ffa450edef68 commit b3c979bf9f29880882a3be25b69962f833cbf002 Author: Bob Paulin Date: 2015-12-14T01:32:23Z TIKA-1810 - Added tika parser modules to the parent pom git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1719854 13f79535-47bb-0310-9956-ffa450edef68 commit 00b2b9e97ace292aa41cc95203b2306549694a71 Author: Bob Paulin Date: 2015-12-28T23:10:16Z TIKA-1818 - Decouple test documents from parsers so they can be reused. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1722027 13f79535-47bb-0310-9956-ffa450edef68 commit a43670685274490d2ff7424e8ae0c3ea1bd29c93 Author: Bob Paulin Date: 2015-12-28T23:14:03Z TIKA-1818 - Added ignores to test project. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1722028 13f79535-47bb-0310-9956-ffa450edef68 commit 016c52fdaaf7250f6b762e8356447598bcd873a4 Author: Bob Paulin Date: 2015-12-28T23:22:46Z TIKA-1812 - Moving multimedia sources to module. git-svn-id: https://svn.apache.org/repos/asf/tika/branches/2.x@1722029 13f79535-47bb-0310-9956-ffa450edef68 commit f7109c58b744beca7b579920369a0588def6dde9 Author: Bob Paulin Date: 2015-12-28T23:26:00Z TIKA-1812 - Copying the multimedia module classes back into tika-parsers with the maven shade plugin. This will allow creation of an uber jar. git-svn-id:
[GitHub] tika pull request: tika_2.x
Github user kulkarniachyut closed the pull request at: https://github.com/apache/tika/pull/70 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---