Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Taken. Thanks. in progress ...

On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Trunk is the current checkout/branch:

 http://svn.apache.org/repos/asf/tika/trunk


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov olegtikho...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 10:16 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi, I can try this on.
 What is a trunk?
 
 
 Thanks,
 Oleg
 
 On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hmm any idea why this is failing on Windows? Tyler P. and
  I were talking the other day - maybe we shouldn't run the
  tests from TIKA-1422 unless Tesseract is installed? Thoughts?
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Hong-Thai Nguyen thaicha...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, October 16, 2014 at 2:03 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi Andrzej,
  
  We are impatient for 1.7 release too.
  I'm having compiling problem of TIKA-1422 on me. If anyone can build
  successfully on Windows, I have no objection to release 1.7
  
  Thanks,
  
  On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
 wrote:
  
   Hi,
  
   Any news on the 1.7 release? or at least a 1.6.1 release that
 includes
  the
   fix for broken ODF parsing...
  
   ---
   Best regards,
  
   Andrzej Bialecki
  
  
  
  
  --
  --
  Hong-Thai
 
 




[jira] [Updated] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Oleg Tikhonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Tikhonov updated TIKA-1422:

Attachment: TIKA-1422.oleg.20141021.patch

Were missing imports of image parsers in the TesseractOCRParser unit test.

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1

[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Oleg Tikhonov (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178018#comment-14178018
 ] 

Oleg Tikhonov edited comment on TIKA-1422 at 10/21/14 6:19 AM:
---

Were missing imports of image parsers in the TesseractOCRParser unit test.

Env:
Windows 7, PE, x64. 
java version 1.7.0_11
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Output:
After import image parsers:
[INFO] 
[INFO] Building Apache Tika 1.7-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ tika ---
[INFO] Deleting E:\work_dir\tika\tika-site\target
[INFO]
[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---
[INFO]
[INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika 
---
[INFO]
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika ---
[INFO] Installing E:\work_dir\tika\tika-site\pom.xml to 
\.m2\repository\org\apache\tika\tika\1.7-SNAPSHOT\tika-1.7-SNAPSHOT.pom
[INFO] 
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Tika parent  SUCCESS [1.093s]
[INFO] Apache Tika core .. SUCCESS [14.594s]
[INFO] Apache Tika parsers ... SUCCESS [49.359s]
[INFO] Apache Tika XMP ... SUCCESS [1.161s]
[INFO] Apache Tika serialization . SUCCESS [1.311s]
[INFO] Apache Tika application ... SUCCESS [11.725s]
[INFO] Apache Tika OSGi bundle ... SUCCESS [19.826s]
[INFO] Apache Tika server  SUCCESS [15.705s]
[INFO] Apache Tika translate . SUCCESS [1.476s]
[INFO] Apache Tika examples .. SUCCESS [2.231s]
[INFO] Apache Tika Java-7 Components . SUCCESS [1.429s]
[INFO] Apache Tika ... SUCCESS [0.029s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 2:00.578s
[INFO] Finished at: Tue Oct 21 08:12:17 IST 2014
[INFO] Final Memory: 67M/1156M
[INFO] 



was (Author: olegt):
Were missing imports of image parsers in the TesseractOCRParser unit test.

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div

Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Please take a try with newest patch.
Cheers,
Oleg

On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com
wrote:

 Taken. Thanks. in progress ...

 On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Trunk is the current checkout/branch:

 http://svn.apache.org/repos/asf/tika/trunk


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov olegtikho...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 10:16 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi, I can try this on.
 What is a trunk?
 
 
 Thanks,
 Oleg
 
 On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hmm any idea why this is failing on Windows? Tyler P. and
  I were talking the other day - maybe we shouldn't run the
  tests from TIKA-1422 unless Tesseract is installed? Thoughts?
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Hong-Thai Nguyen thaicha...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, October 16, 2014 at 2:03 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi Andrzej,
  
  We are impatient for 1.7 release too.
  I'm having compiling problem of TIKA-1422 on me. If anyone can build
  successfully on Windows, I have no objection to release 1.7
  
  Thanks,
  
  On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
 wrote:
  
   Hi,
  
   Any news on the 1.7 release? or at least a 1.6.1 release that
 includes
  the
   fix for broken ODF parsing...
  
   ---
   Best regards,
  
   Andrzej Bialecki
  
  
  
  
  --
  --
  Hong-Thai
 
 





Re: 1.7 release?

2014-10-21 Thread Mattmann, Chris A (3980)
Thanks Oleg, will try tomorrow for me Los angeles time!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Oleg Tikhonov o...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, October 20, 2014 at 11:20 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: 1.7 release?

Please take a try with newest patch.
Cheers,
Oleg

On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com
wrote:

 Taken. Thanks. in progress ...

 On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Trunk is the current checkout/branch:

 http://svn.apache.org/repos/asf/tika/trunk


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov olegtikho...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 10:16 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Hi, I can try this on.
 What is a trunk?
 
 
 Thanks,
 Oleg
 
 On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hmm any idea why this is failing on Windows? Tyler P. and
  I were talking the other day - maybe we shouldn't run the
  tests from TIKA-1422 unless Tesseract is installed? Thoughts?
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Hong-Thai Nguyen thaicha...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Thursday, October 16, 2014 at 2:03 AM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi Andrzej,
  
  We are impatient for 1.7 release too.
  I'm having compiling problem of TIKA-1422 on me. If anyone can
build
  successfully on Windows, I have no objection to release 1.7
  
  Thanks,
  
  On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
 wrote:
  
   Hi,
  
   Any news on the 1.7 release? or at least a 1.6.1 release that
 includes
  the
   fix for broken ODF parsing...
  
   ---
   Best regards,
  
   Andrzej Bialecki
  
  
  
  
  --
  --
  Hong-Thai
 
 






Re: 1.7 release?

2014-10-21 Thread Oleg Tikhonov
Sorry!!!

On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Thanks Oleg, will try tomorrow for me Los angeles time!

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Oleg Tikhonov o...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Monday, October 20, 2014 at 11:20 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: 1.7 release?

 Please take a try with newest patch.
 Cheers,
 Oleg
 
 On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com
 wrote:
 
  Taken. Thanks. in progress ...
 
  On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
  Trunk is the current checkout/branch:
 
  http://svn.apache.org/repos/asf/tika/trunk
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Oleg Tikhonov olegtikho...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Monday, October 20, 2014 at 10:16 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: 1.7 release?
 
  Hi, I can try this on.
  What is a trunk?
  
  
  Thanks,
  Oleg
  
  On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
  
   Hmm any idea why this is failing on Windows? Tyler P. and
   I were talking the other day - maybe we shouldn't run the
   tests from TIKA-1422 unless Tesseract is installed? Thoughts?
  
   ++
   Chris Mattmann, Ph.D.
   Chief Architect
   Instrument Software and Science Data Systems Section (398)
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 168-519, Mailstop: 168-527
   Email: chris.a.mattm...@nasa.gov
   WWW:  http://sunset.usc.edu/~mattmann/
   ++
   Adjunct Associate Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++
  
  
  
  
  
  
   -Original Message-
   From: Hong-Thai Nguyen thaicha...@gmail.com
   Reply-To: dev@tika.apache.org dev@tika.apache.org
   Date: Thursday, October 16, 2014 at 2:03 AM
   To: dev@tika.apache.org dev@tika.apache.org
   Subject: Re: 1.7 release?
  
   Hi Andrzej,
   
   We are impatient for 1.7 release too.
   I'm having compiling problem of TIKA-1422 on me. If anyone can
 build
   successfully on Windows, I have no objection to release 1.7
   
   Thanks,
   
   On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org
  wrote:
   
Hi,
   
Any news on the 1.7 release? or at least a 1.6.1 release that
  includes
   the
fix for broken ODF parsing...
   
---
Best regards,
   
Andrzej Bialecki
   
   
   
   
   --
   --
   Hong-Thai
  
  
 
 
 




[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-21 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178036#comment-14178036
 ] 

Lewis John McGibbney commented on TIKA-1423:


Hi [~vinegh] how is this coming on? Would you like a hand? It would be great to 
get this in to Tika 1.7

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika 1.6 update in Maven Central?

2014-10-21 Thread Lewis John Mcgibbney
Hi Chris,

On Mon, Oct 20, 2014 at 11:37 PM, dev-digest-h...@tika.apache.org wrote:


 We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully
 working on Windows first.

 Any one of the other devs having things we should get into 1.7?


I would very much like to see
https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are nearly
there, we merely need to write unit tests, document methods, build this
into a patch and submit it ti Jira for review. I will work with Vineet to
get this straightened out.
Thanks
Lewis


Re: Tika 1.6 update in Maven Central?

2014-10-21 Thread Mattmann, Chris A (3980)
Thanks Lewis!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Monday, October 20, 2014 at 11:48 PM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: Tika 1.6 update in Maven Central?

Hi Chris,

On Mon, Oct 20, 2014 at 11:37 PM, dev-digest-h...@tika.apache.org wrote:


 We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully
 working on Windows first.

 Any one of the other devs having things we should get into 1.7?


I would very much like to see
https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are
nearly
there, we merely need to write unit tests, document methods, build this
into a patch and submit it ti Jira for review. I will work with Vineet to
get this straightened out.
Thanks
Lewis



[jira] [Assigned] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-21 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned TIKA-1423:
--

Assignee: Lewis John McGibbney

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Lewis John McGibbney
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats

2014-10-21 Thread Vineet Ghatge (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178052#comment-14178052
 ] 

Vineet Ghatge commented on TIKA-1423:
-

Hey [~lewismc] I am working on it, I will post my updates this week. You can 
assign this to me. 

 Build a parser to extract data from GRIB formats
 

 Key: TIKA-1423
 URL: https://issues.apache.org/jira/browse/TIKA-1423
 Project: Tika
  Issue Type: New Feature
  Components: metadata, mime, parser
Affects Versions: 1.6
Reporter: Vineet Ghatge
Assignee: Lewis John McGibbney
Priority: Critical
  Labels: features, newbie
 Fix For: 1.7

 Attachments: GribParser.java, 
 NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2


 Arctic dataset contains a MIME format called GRIB -  General 
 Regularly­distributed information in Binary form 
 http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is 
 a concise data format used in meteorology to store historical and 
 weather data. There are 2 different types of the format ­ GRIB 0, GRIB 2.  
 The focus will be on GRIB 2 which is the most prevalent. Each GRIB record 
 intended for either transmission or storage contains a single parameter with 
 values located at an array of grid points, or represented as a set of 
 spectral coefficients, for a single level (or layer), encoded as a continuous 
 bit stream. Logical divisions of the record are designated as sections, 
 each of which provides control information and/or data. A GRIB record 
 consists of six sections, two of which are optional: 
  
 (0) Indicator Section 
 (1) Product Definition Section (PDS) 
 (2) Grid Description Section (GDS) ­ optional 
 (3) Bit Map Section (BMS) ­ optional 
 (4) Binary Data Section (BDS) 
 (5) '' (ASCII Characters)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen commented on TIKA-1422:


Applied latest fix on r1633325 with some formatting. Thank

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1

tika-trunk-jdk1.7 - Build # 273 - Failure

2014-10-21 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #273)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/273/ to 
view the results.

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178197#comment-14178197
 ] 

Hudson commented on TIKA-1422:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #273 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/273/])
TIKA-1422 - Apply fix of [~olegt] in Windows (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633325)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84

[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186
 ] 

Hong-Thai Nguyen edited comment on TIKA-1422 at 10/21/14 9:48 AM:
--

Applied latest fix on r1633325  r161 with some formatting. Thank


was (Author: thaichat04):
Applied latest fix on r1633325 with some formatting. Thank

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178226#comment-14178226
 ] 

Hudson commented on TIKA-1422:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #253 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/253/])
TIKA-1422 - Fixing build  minor refactory of naming test class (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=161)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java
TIKA-1422 - Apply fix of [~olegt] in Windows (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633325)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java


 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178257#comment-14178257
 ] 

Hudson commented on TIKA-1422:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #274 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/274/])
TIKA-1422 - Fixing build  minor refactory of naming test class (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=161)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java


 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76

[jira] [Created] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2014-10-21 Thread Abhishek (JIRA)
Abhishek created TIKA-1452:
--

 Summary: parser.parse() throws exception after which the procesed 
file is not getting renamed/moved/deleted
 Key: TIKA-1452
 URL: https://issues.apache.org/jira/browse/TIKA-1452
 Project: Tika
  Issue Type: Bug
  Components: detector, metadata, parser
Affects Versions: 1.6
 Environment: jre6
Reporter: Abhishek


I am passing a file as input stream to parser.parse() method while using apache 
tika library to convert file to text.The method throws an exception (displayed 
below) but the input stream is closed in the finally block successfully. Then 
while renaming the file, the File.renameTo method from java.io returns false. I 
am not able to rename/delete/move the file despite successfully closing the 
inputStream. I am afraid another instance of file is created, while 
parser.parse() method processess the file, which doesn't get closed till the 
time exception is throw. Is that possible? If so what should I do to rename or 
delete the file.

The Exception thrown while checking the content type is

java.lang.NoClassDefFoundError: Could not initialize class 
com.adobe.xmp.impl.XMPMetaParser
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
at 
com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
at 
com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)

at 
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178360#comment-14178360
 ] 

Hudson commented on TIKA-1311:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #275 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/275/])
clean up from TIKA-1311 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633357)
* /tika/trunk/tika-app/src/main/java/org/apache/tika/io


 Centralize JSON handling of Metadata
 

 Key: TIKA-1311
 URL: https://issues.apache.org/jira/browse/TIKA-1311
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-1311.patch


 When json was initially added to TIKA CLI (TIKA-213), there was a 
 recommendation to centralize JSON handling of Metadata, potentially putting 
 it in core.  On a recent bug fix (TIKA-1291), the same recommendation was 
 repeated especially noting that we now handle JSON/Metadata differently in 
 CLI and server.
 Let's centralize JSON handling in core and use GSON.  We should add a 
 serializer and a deserializer so that users don't have to reinvent that wheel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361
 ] 

Andrew Jackson commented on TIKA-1302:
--

Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2014-10-21 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178362#comment-14178362
 ] 

Nick Burch commented on TIKA-1452:
--

Can you provide a junit test case that shows how to reproduce the issue?

Also, have you thought about fixing your underlying missing dependency issue?

 parser.parse() throws exception after which the procesed file is not getting 
 renamed/moved/deleted
 --

 Key: TIKA-1452
 URL: https://issues.apache.org/jira/browse/TIKA-1452
 Project: Tika
  Issue Type: Bug
  Components: detector, metadata, parser
Affects Versions: 1.6
 Environment: jre6
Reporter: Abhishek

 I am passing a file as input stream to parser.parse() method while using 
 apache tika library to convert file to text.The method throws an exception 
 (displayed below) but the input stream is closed in the finally block 
 successfully. Then while renaming the file, the File.renameTo method from 
 java.io returns false. I am not able to rename/delete/move the file despite 
 successfully closing the inputStream. I am afraid another instance of file is 
 created, while parser.parse() method processess the file, which doesn't get 
 closed till the time exception is throw. Is that possible? If so what should 
 I do to rename or delete the file.
 The Exception thrown while checking the content type is
 java.lang.NoClassDefFoundError: Could not initialize class 
 com.adobe.xmp.impl.XMPMetaParser
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
 at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
 
 at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
 at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-10-21 Thread Andrew Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361
 ] 

Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM:
-

Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and 
[~talli...@apache.org]'s efforts to run this on GovDocs, and would be 
interested in comparing results. We already publish format profile data about 
web archives, and would love to have more data to refer to.


was (Author: anjackson):
Okay, so the c.300,000 exceptions are here: 
https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me 
know if you'd like it placed elsewhere (it's 14MB of compressed CSV).

This conversation has helped me spot a gap in our code. We currently do a 
Tika.detect() before we do a Tika.parse(), and only do the latter if the former 
succeeded. Sadly, the version of the code that I used to generate this data did 
not record the Tika exception for the .detect() step, only the .parse() step. 
This will explain why there are no hung-thread events in this result set - the 
interrupted .detect() was not recorded properly.  We'll be re-running this scan 
soonish, so I'll make sure the next version records all the exceptions. IIRC, 
from looking at the MIME types, the permanent hangs were mostly ZIPs, Office 
documents, and maybe some PDFs.

Note that the CSV includes the Content-Type from the .detect() step, and this 
should indicate which module was run on the resource (i.e. whatever the Tika 
1.5 mapping was for that MIME type). I don't think we changed the parse 
configuration significantly, so it seems HTML and XHTML and XML should all have 
gone through the HtmlParser (I'm not 100% sure about this, and will try to 
check).

I'm not sure it's worth giving you all the SAX exceptions, as there are a lot 
of repeats of the same problems. I think a random sample of about 50,000 should 
be plenty. Does that sound okay to you?

 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

2014-10-21 Thread Mattmann, Chris A (3980)
Hi Hong-Thai,

These commits look strange to me - it looks like it subtracts the
whole files (and the unit test removed the test file, renamed it,
and then added what largely looks like the same file, back?)

Any idea what¹s up?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: thaicha...@apache.org thaicha...@apache.org
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, October 21, 2014 at 2:32 AM
To: comm...@tika.apache.org comm...@tika.apache.org
Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src:
main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

Author: thaichat04
Date: Tue Oct 21 09:32:06 2014
New Revision: 1633325

URL: http://svn.apache.org/r1633325
Log:
TIKA-1422 - Apply fix of [~olegt] in Windows

Modified:

tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRParser.java

tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822Pa
rserTest.java

Modified: 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRParser.java
URL: 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa
che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=1633
325view=diff
==

--- 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRParser.java (original)
+++ 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
OCRParser.java Tue Oct 21 09:32:06 2014
@@ -26,11 +26,11 @@ import java.io.IOException;
 import java.io.InputStream;
 import java.io.InputStreamReader;
 import java.io.Reader;
+import java.util.ArrayList;
 import java.util.HashSet;
+import java.util.List;
 import java.util.Map;
 import java.util.Set;
-import java.util.List;
-import java.util.ArrayList;
 import java.util.concurrent.Callable;
 import java.util.concurrent.ExecutionException;
 import java.util.concurrent.FutureTask;
@@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou
 import org.apache.tika.io.TikaInputStream;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.mime.MediaType;
-import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.AbstractParser;
 import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.external.ExternalParser;
+import org.apache.tika.parser.image.ImageParser;
+import org.apache.tika.parser.image.PSDParser;
+import org.apache.tika.parser.image.TiffParser;
+import org.apache.tika.parser.jpeg.JpegParser;
 import org.apache.tika.sax.XHTMLContentHandler;
 import org.xml.sax.ContentHandler;
 import org.xml.sax.SAXException;
 
 /**
- * TesseractOCRParser powered by tesseract-ocr engine.
- * To enable this parser, create a {@link TesseractOCRConfig}
- * object and pass it through a ParseContext.
- * Tesseract-ocr must be installed and on system path or
- * the path to its root folder must be provided:
+ * TesseractOCRParser powered by tesseract-ocr engine. To enable this
parser,
+ * create a {@link TesseractOCRConfig} object and pass it through a
+ * ParseContext. Tesseract-ocr must be installed and on system path or
the path
+ * to its root folder must be provided:
  * p
  * TesseractOCRConfig config = new TesseractOCRConfig();br
  * //Needed if tesseract is not on system pathbr
@@ -69,226 +72,231 @@ import org.xml.sax.SAXException;
  * 
  */
 public class TesseractOCRParser extends AbstractParser {
-  
-  private static final long serialVersionUID = 1L;
-  
-  private static final SetMediaType SUPPORTED_TYPES = getTypes();
-  
-  private static SetMediaType getTypes() {
-  HashSetMediaType supportedTypes = new HashSetMediaType();
-  
-  supportedTypes.add(MediaType.image(png));
-  supportedTypes.add(MediaType.image(jpeg));
-  supportedTypes.add(MediaType.image(tiff));
-  supportedTypes.add(MediaType.image(x-ms-bmp));
-  supportedTypes.add(MediaType.image(gif));
-  
-  return supportedTypes;
-  }
-  
-  @Override
-  public SetMediaType getSupportedTypes(ParseContext arg0) {
-  return SUPPORTED_TYPES;
-  }
-
-private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) {
-

Re: Tika 1.6 update in Maven Central?

2014-10-21 Thread Aeham Abushwashi
Thanks Chris. Any ideas when that is likely to happen?
I'm trying to determine whether I can wait for a 1.7 release. If not, I
think my only option to avoid the uncontrolled build up of tmp files (when
processing .7z archives) would be to go back to 1.5.

Regards,
Aeham


parser.parse() throws exception after which the procesed file is not getting renamed/moved.

2014-10-21 Thread Tony Braganza
I am passing a file as input stream to parser.parse() method while using
apache tika library to convert file to text.The method throws an exception
(displayed below) but the input stream is closed in the finally block
successfully. Then while renaming the file, the File.renameTo method from
java.io returns false. I am not able to rename/delete/move the file despite
successfully closing the inputStream. I am afraid another instance of file
is created, while parser.parse() method processess the file, which doesn't
get closed till the time exception is throw. Is that possible? If so what
should I do to rename or delete the file.

The Exception thrown while checking the content type is

java.lang.NoClassDefFoundError: Could not initialize class
com.adobe.xmp.impl.XMPMetaParser
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
at
com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
at
com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)

at
org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
 
at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/parser-parse-throws-exception-after-which-the-procesed-file-is-not-getting-renamed-moved-tp4165153.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

2014-10-21 Thread Hong-Thai Nguyen
Hi Chris,

Yes, I made a mistake on this commit by missing a renaming file and broke
build, the next commit corrected:
Revision: 161
Author: thaichat04
Date: mardi 21 octobre 2014 11:47:54
Message:
TIKA-1422 - Fixing build  minor refactory of naming test class

Modified :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
Added :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
Deleted :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java

Please 'pull' latest again then tell me if OK ?

Sorry

On Tue, Oct 21, 2014 at 3:49 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Hong-Thai,

 These commits look strange to me - it looks like it subtracts the
 whole files (and the unit test removed the test file, renamed it,
 and then added what largely looks like the same file, back?)

 Any idea what¹s up?

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: thaicha...@apache.org thaicha...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, October 21, 2014 at 2:32 AM
 To: comm...@tika.apache.org comm...@tika.apache.org
 Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src:
 main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
 test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

 Author: thaichat04
 Date: Tue Oct 21 09:32:06 2014
 New Revision: 1633325
 
 URL: http://svn.apache.org/r1633325
 Log:
 TIKA-1422 - Apply fix of [~olegt] in Windows
 
 Modified:
 
 tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
 OCRParser.java
 
 tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822Pa
 rserTest.java
 
 Modified:
 tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
 OCRParser.java
 URL:
 
 http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa
 che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=1633
 325view=diff
 ==
 
 ---
 tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
 OCRParser.java (original)
 +++
 tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
 OCRParser.java Tue Oct 21 09:32:06 2014
 @@ -26,11 +26,11 @@ import java.io.IOException;
  import java.io.InputStream;
  import java.io.InputStreamReader;
  import java.io.Reader;
 +import java.util.ArrayList;
  import java.util.HashSet;
 +import java.util.List;
  import java.util.Map;
  import java.util.Set;
 -import java.util.List;
 -import java.util.ArrayList;
  import java.util.concurrent.Callable;
  import java.util.concurrent.ExecutionException;
  import java.util.concurrent.FutureTask;
 @@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou
  import org.apache.tika.io.TikaInputStream;
  import org.apache.tika.metadata.Metadata;
  import org.apache.tika.mime.MediaType;
 -import org.apache.tika.parser.Parser;
  import org.apache.tika.parser.AbstractParser;
  import org.apache.tika.parser.ParseContext;
 +import org.apache.tika.parser.Parser;
  import org.apache.tika.parser.external.ExternalParser;
 +import org.apache.tika.parser.image.ImageParser;
 +import org.apache.tika.parser.image.PSDParser;
 +import org.apache.tika.parser.image.TiffParser;
 +import org.apache.tika.parser.jpeg.JpegParser;
  import org.apache.tika.sax.XHTMLContentHandler;
  import org.xml.sax.ContentHandler;
  import org.xml.sax.SAXException;
 
  /**
 - * TesseractOCRParser powered by tesseract-ocr engine.
 - * To enable this parser, create a {@link TesseractOCRConfig}
 - * object and pass it through a ParseContext.
 - * Tesseract-ocr must be installed and on system path or
 - * the path to its root folder must be provided:
 + * TesseractOCRParser powered by tesseract-ocr engine. To enable this
 parser,
 + * create a {@link TesseractOCRConfig} object and pass it through a
 + * ParseContext. Tesseract-ocr must be installed and on system path or
 the path
 + * to its root folder must be provided:
   * p
   * TesseractOCRConfig config = new TesseractOCRConfig();br
   * //Needed if tesseract is not on system pathbr
 @@ -69,226 +72,231 @@ import org.xml.sax.SAXException;
   *
   */
  public class TesseractOCRParser extends AbstractParser {
 -
 -  private static final long 

Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

2014-10-21 Thread Mattmann, Chris A (3980)
No worries Hong-Thai! Will update and test, thanks!

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Hong-Thai Nguyen thaicha...@gmail.com
Reply-To: dev@tika.apache.org dev@tika.apache.org
Date: Tuesday, October 21, 2014 at 6:57 AM
To: dev@tika.apache.org dev@tika.apache.org
Subject: Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src:
main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

Hi Chris,

Yes, I made a mistake on this commit by missing a renaming file and broke
build, the next commit corrected:
Revision: 161
Author: thaichat04
Date: mardi 21 octobre 2014 11:47:54
Message:
TIKA-1422 - Fixing build  minor refactory of naming test class

Modified :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822P
arserTest.java
Added :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tesserac
tOCRParserTest.java
Deleted :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tesserac
tOCRTest.java

Please 'pull' latest again then tell me if OK ?

Sorry

On Tue, Oct 21, 2014 at 3:49 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Hong-Thai,

 These commits look strange to me - it looks like it subtracts the
 whole files (and the unit test removed the test file, renamed it,
 and then added what largely looks like the same file, back?)

 Any idea what¹s up?

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: thaicha...@apache.org thaicha...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, October 21, 2014 at 2:32 AM
 To: comm...@tika.apache.org comm...@tika.apache.org
 Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src:
 main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
 test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

 Author: thaichat04
 Date: Tue Oct 21 09:32:06 2014
 New Revision: 1633325
 
 URL: http://svn.apache.org/r1633325
 Log:
 TIKA-1422 - Apply fix of [~olegt] in Windows
 
 Modified:
 
 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera
ct
 OCRParser.java
 
 
tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822
Pa
 rserTest.java
 
 Modified:
 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera
ct
 OCRParser.java
 URL:
 
 
http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/ap
a
 
che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=16
33
 325view=diff
 

==
 
 ---
 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera
ct
 OCRParser.java (original)
 +++
 
tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera
ct
 OCRParser.java Tue Oct 21 09:32:06 2014
 @@ -26,11 +26,11 @@ import java.io.IOException;
  import java.io.InputStream;
  import java.io.InputStreamReader;
  import java.io.Reader;
 +import java.util.ArrayList;
  import java.util.HashSet;
 +import java.util.List;
  import java.util.Map;
  import java.util.Set;
 -import java.util.List;
 -import java.util.ArrayList;
  import java.util.concurrent.Callable;
  import java.util.concurrent.ExecutionException;
  import java.util.concurrent.FutureTask;
 @@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou
  import org.apache.tika.io.TikaInputStream;
  import org.apache.tika.metadata.Metadata;
  import org.apache.tika.mime.MediaType;
 -import org.apache.tika.parser.Parser;
  import org.apache.tika.parser.AbstractParser;
  import org.apache.tika.parser.ParseContext;
 +import org.apache.tika.parser.Parser;
  import org.apache.tika.parser.external.ExternalParser;
 +import org.apache.tika.parser.image.ImageParser;
 +import org.apache.tika.parser.image.PSDParser;
 +import 

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-10-21 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178866#comment-14178866
 ] 

Tyler Palsulich commented on TIKA-1422:
---

{code}
Results :

Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
(..)

Tests run: 546, Failures: 1, Errors: 0, Skipped: 4
{code}
{code}
Wanted 5 times but was 4
at 
org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:93)
Caused by: org.mockito.exceptions.cause.TooLittleInvocations:
{code}

Still getting a failing test with Tesseract 3.02.02 installed on Mac. Will look 
into this more tomorrow. But, thank you, [~o...@apache.org] and [~thaichat04]!

 org.apache.tika.parser.mail.RFC822ParserTest fails
 --

 Key: TIKA-1422
 URL: https://issues.apache.org/jira/browse/TIKA-1422
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1422.Mattmann.100114.patch.txt, 
 TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, 
 TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch


 I'm seeing test failures from:
 {noformat}
 Results :
 Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
 (..)
 Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
 {noformat}
 CentOS6 VM image, running:
 {noformat}
 [mattmann@memex tika]$ java -version
 java version 1.7.0_67
 Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
 Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
 [mattmann@memex tika]$ mvn -version
 Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
 2014-02-14T09:37:52-08:00)
 Maven home: /usr/share/apache-maven
 Java version: 1.7.0_65, vendor: Oracle Corporation
 Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: 
 amd64, family: unix
 [mattmann@memex tika]$ 
 {noformat}
 Here are the surefire reports - no clue what's up here:
 {noformat}
 [mattmann@memex tika]$ more 
 tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
  
 ---
 Test set: org.apache.tika.parser.mail.RFC822ParserTest
 ---
 Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec  
 FAILURE!
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.152 sec   FAILURE!
 org.mockito.exceptions.verification.TooManyActualInvocations: 
 xHTMLContentHandler.startElement(
 http://www.w3.org/1999/xhtml;,
 div,
 div,
 isA(org.xml.sax.Attributes)
 );
 Wanted 4 times but was 5
   at 
 org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
 Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
 Undesired invocation:
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
   at 
 org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
   at 
 org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
   at 
 org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
   at 
 org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76

[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-21 Thread Bin Hawking (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178889#comment-14178889
 ] 

Bin Hawking commented on TIKA-1446:
---

The above attached is my fix, which is the old or new code in 
tika-1.6\tika-parsers\src\main\java\org\apache\tika\parser\chm\

Please use diff to see my changes. 

This fix addresses TIKA- 1430, 1446, 1447, 1448.

NOTE: My fix is not well tested and may be incomplete. And, because I am adding 
new features to the chm parser for my own application,including parsing HHK and 
HHC files for more metadata; there are some distractions in my revisions which 
are not applicable to the original tika project. Sorry for the inconvenience.


 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-21 Thread Bin Hawking (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bin Hawking updated TIKA-1446:
--
Attachment: chm.zip

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical
 Attachments: chm.zip


 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-10-21 Thread Bin Hawking (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bin Hawking updated TIKA-1446:
--
Attachment: (was: chm.zip)

 CHM parser : wrong decompression of aligned blocks
 --

 Key: TIKA-1446
 URL: https://issues.apache.org/jira/browse/TIKA-1446
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Bin Hawking
Priority: Critical

 If an embedded file contains aligned blocks, the parser outputs chaotic text 
 or empty text as to this file.
 I have fixed it myself, corrected decompressAlignedBlock() and its 
 preparation methods. Mostly this bug is due to misusing main tree/align 
 tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


import (re)ordering?

2014-10-21 Thread Allison, Timothy B.
All,
  I have Intellij set to order imports by javax, java, then other.  I think 
this is the most common pattern in Tika.  Is it ok if I make these 
(meaningless/formatting) changes when I commit other changes?
  Thank you.

   Best,

  Tim


[jira] [Created] (TIKA-1453) fails to parse RFC3464 documents

2014-10-21 Thread Rob Tulloh (JIRA)
Rob Tulloh created TIKA-1453:


 Summary: fails to parse RFC3464 documents
 Key: TIKA-1453
 URL: https://issues.apache.org/jira/browse/TIKA-1453
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Rob Tulloh
Priority: Minor


Tika 1.5 does not support content-type message/delivery-status

http://tools.ietf.org/html/rfc3464

Notes from Oracle indicate that javamail now supports this RFC.

curl -H Content-Type:message/delivery-status -T /tmp/xxx 
http://localhost:9998/tika

Produces

2014-10-21_21:06:40.23890 Oct 21, 2014 4:06:40 PM 
org.apache.tika.server.TikaResource logRequest
2014-10-21_21:06:40.23894 INFO: tika (message/delivery-status)
2014-10-21_21:06:40.23994 Oct 21, 2014 4:06:40 PM 
org.apache.tika.server.TikaResource$3 write
2014-10-21_21:06:40.23995 WARNING: tika: Text extraction failed
2014-10-21_21:06:40.23996 org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.server.TikaResource$1@dae96f2
2014-10-21_21:06:40.23997   at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
2014-10-21_21:06:40.23997   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
2014-10-21_21:06:40.23998   at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
2014-10-21_21:06:40.23998   at 
org.apache.tika.server.TikaResource$3.write(TikaResource.java:196)
2014-10-21_21:06:40.23999   at 
org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:118)
2014-10-21_21:06:40.23999   at 
org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1317)
2014-10-21_21:06:40.24000   at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:284)
2014-10-21_21:06:40.24000   at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:157)
2014-10-21_21:06:40.24001   at 
org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:86)
2014-10-21_21:06:40.24002   at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272)
2014-10-21_21:06:40.24003   at 
org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:77)
2014-10-21_21:06:40.24004   at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272)
2014-10-21_21:06:40.24007   at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
2014-10-21_21:06:40.24008   at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:355)
2014-10-21_21:06:40.24009   at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:319)
2014-10-21_21:06:40.24009   at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72)
2014-10-21_21:06:40.24010   at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
2014-10-21_21:06:40.24010   at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
2014-10-21_21:06:40.24011   at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
2014-10-21_21:06:40.24011   at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
2014-10-21_21:06:40.24012   at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
2014-10-21_21:06:40.24013   at 
org.eclipse.jetty.server.Server.handle(Server.java:370)
2014-10-21_21:06:40.24014   at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
2014-10-21_21:06:40.24014   at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
2014-10-21_21:06:40.24015   at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
2014-10-21_21:06:40.24015   at 
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
2014-10-21_21:06:40.24016   at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: import (re)ordering?

2014-10-21 Thread Nick Burch

On Tue, 21 Oct 2014, Allison, Timothy B. wrote:
I have Intellij set to order imports by javax, java, then other.  I 
think this is the most common pattern in Tika.  Is it ok if I make these 
(meaningless/formatting) changes when I commit other changes?


The only downside of this is that the top of the commit message is then 
all noise, so it's less likely that people will end up skipping the 
review of the meat of the commit


It's not always possible, but where you can, it's generally best to split 
up tidy up / formatting commits (whitespace, imports, formatting etc) 
ones from ones that touch functionality.


Cheers
Nick


[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179422#comment-14179422
 ] 

Hudson commented on TIKA-1451:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #276 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/276/])
TIKA-1451 add RecursiveParserWrapper output to CLI and GUI (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633499)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/test-data/test_recursive_embedded.docx
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadata.java
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataBase.java
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java
* 
/tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataListTest.java
* 
/tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataTest.java


 Add Recursive Metadata Parser Wrapper output to tika-app and gui
 

 Key: TIKA-1451
 URL: https://issues.apache.org/jira/browse/TIKA-1451
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.7

 Attachments: integrate_recursive_metadata_wrapper.patch


 It would be helpful to expose the output of the recursive metadata parser 
 wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui

2014-10-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179441#comment-14179441
 ] 

Hudson commented on TIKA-1451:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #255 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/255/])
TIKA-1451 add RecursiveParserWrapper output to CLI and GUI (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633499)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java
* /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java
* /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* /tika/trunk/tika-app/src/test/resources/test-data/test_recursive_embedded.docx
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadata.java
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataBase.java
* 
/tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java
* 
/tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataListTest.java
* 
/tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataTest.java


 Add Recursive Metadata Parser Wrapper output to tika-app and gui
 

 Key: TIKA-1451
 URL: https://issues.apache.org/jira/browse/TIKA-1451
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.7

 Attachments: integrate_recursive_metadata_wrapper.patch


 It would be helpful to expose the output of the recursive metadata parser 
 wrapper in the gui and in the command line for tika-app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)