[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173537#comment-14173537 ] Hong-Thai Nguyen commented on TIKA-1422: I'm not using Tesseract org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at
1.7 release?
Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing… --- Best regards, Andrzej Bialecki
Re: 1.7 release?
Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing… --- Best regards, Andrzej Bialecki -- -- Hong-Thai
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173839#comment-14173839 ] Tyler Palsulich commented on TIKA-1422: --- Can you check what {{%ErrorLevel%}} is when you try to run Tesseract from command line? org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx) Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173983#comment-14173983 ] Tilman Hausherr commented on TIKA-1442: --- After some more research, I was able to decode 5 more files (the cause was not the LZW filter, see ). However 7 other files are really corrupt, portions of the files are blank when shown in AR: 115/115269.pdf 211/211876.pdf 268/268346.pdf 389/389474.pdf 443/443752.pdf 698/698813.pdf 846/846759.pdf Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-1442: -- Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx Upgrade to PDFBox 1.8.8 --- Key: TIKA-1442 URL: https://issues.apache.org/jira/browse/TIKA-1442 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Fix For: 1.7 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika 1.7. Let's use this issue to carry on the discussion of regression testing (if any further discussion is necessary) or any other prep that needs to happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)