[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-16 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173537#comment-14173537
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


I'm not using Tesseract

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, 
 TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
 

1.7 release?

2014-10-16 Thread Andrzej Białecki
Hi,

Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix 
for broken ODF parsing…

---
Best regards,

Andrzej Bialecki



Re: 1.7 release?

2014-10-16 Thread Hong-Thai Nguyen
Hi Andrzej,

We are impatient for 1.7 release too.
I'm having compiling problem of TIKA-1422 on me. If anyone can build
successfully on Windows, I have no objection to release 1.7

Thanks,

On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote:

 Hi,

 Any news on the 1.7 release? or at least a 1.6.1 release that includes the
 fix for broken ODF parsing…

 ---
 Best regards,

 Andrzej Bialecki




-- 
--
Hong-Thai


[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-16 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173839#comment-14173839
 ] 

Tyler Palsulich commented on TIKA-1422:
---

Can you check what {{%ErrorLevel%}} is when you try to run Tesseract from 
command line? 

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.palsulich.100414.patch, 
 TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 

[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx)

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173983#comment-14173983
 ] 

Tilman Hausherr commented on TIKA-1442:
---

After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see ). However 7 other files are really corrupt, portions of 
the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

 Upgrade to PDFBox 1.8.8
 ---

 Key: TIKA-1442
 URL: https://issues.apache.org/jira/browse/TIKA-1442
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
 Fix For: 1.7

 Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx


 Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
 1.7.  Let's use this issue to carry on the discussion of regression testing 
 (if any further discussion is necessary) or any other prep that needs to 
 happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)