Re: 1.7 release?
Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing... --- Best regards, Andrzej Bialecki -- -- Hong-Thai
[jira] [Updated] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg Tikhonov updated TIKA-1422: Attachment: TIKA-1422.oleg.20141021.patch Were missing imports of image parsers in the TesseractOCRParser unit test. org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1
[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178018#comment-14178018 ] Oleg Tikhonov edited comment on TIKA-1422 at 10/21/14 6:19 AM: --- Were missing imports of image parsers in the TesseractOCRParser unit test. Env: Windows 7, PE, x64. java version 1.7.0_11 Java(TM) SE Runtime Environment (build 1.7.0_11-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode) Output: After import image parsers: [INFO] [INFO] Building Apache Tika 1.7-SNAPSHOT [INFO] [INFO] [INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ tika --- [INFO] Deleting E:\work_dir\tika\tika-site\target [INFO] [INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika --- [INFO] [INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika --- [INFO] [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika --- [INFO] Installing E:\work_dir\tika\tika-site\pom.xml to \.m2\repository\org\apache\tika\tika\1.7-SNAPSHOT\tika-1.7-SNAPSHOT.pom [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent SUCCESS [1.093s] [INFO] Apache Tika core .. SUCCESS [14.594s] [INFO] Apache Tika parsers ... SUCCESS [49.359s] [INFO] Apache Tika XMP ... SUCCESS [1.161s] [INFO] Apache Tika serialization . SUCCESS [1.311s] [INFO] Apache Tika application ... SUCCESS [11.725s] [INFO] Apache Tika OSGi bundle ... SUCCESS [19.826s] [INFO] Apache Tika server SUCCESS [15.705s] [INFO] Apache Tika translate . SUCCESS [1.476s] [INFO] Apache Tika examples .. SUCCESS [2.231s] [INFO] Apache Tika Java-7 Components . SUCCESS [1.429s] [INFO] Apache Tika ... SUCCESS [0.029s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 2:00.578s [INFO] Finished at: Tue Oct 21 08:12:17 IST 2014 [INFO] Final Memory: 67M/1156M [INFO] was (Author: olegt): Were missing imports of image parsers in the TesseractOCRParser unit test. org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div
Re: 1.7 release?
Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing... --- Best regards, Andrzej Bialecki -- -- Hong-Thai
Re: 1.7 release?
Thanks Oleg, will try tomorrow for me Los angeles time! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov o...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 11:20 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing... --- Best regards, Andrzej Bialecki -- -- Hong-Thai
Re: 1.7 release?
Sorry!!! On Tue, Oct 21, 2014 at 9:37 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Oleg, will try tomorrow for me Los angeles time! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov o...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 11:20 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Please take a try with newest patch. Cheers, Oleg On Tue, Oct 21, 2014 at 9:08 AM, Oleg Tikhonov olegtikho...@gmail.com wrote: Taken. Thanks. in progress ... On Tue, Oct 21, 2014 at 8:54 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Trunk is the current checkout/branch: http://svn.apache.org/repos/asf/tika/trunk ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Oleg Tikhonov olegtikho...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 10:16 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi, I can try this on. What is a trunk? Thanks, Oleg On Tue, Oct 21, 2014 at 6:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hmm any idea why this is failing on Windows? Tyler P. and I were talking the other day - maybe we shouldn't run the tests from TIKA-1422 unless Tesseract is installed? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Thursday, October 16, 2014 at 2:03 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: 1.7 release? Hi Andrzej, We are impatient for 1.7 release too. I'm having compiling problem of TIKA-1422 on me. If anyone can build successfully on Windows, I have no objection to release 1.7 Thanks, On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki a...@getopt.org wrote: Hi, Any news on the 1.7 release? or at least a 1.6.1 release that includes the fix for broken ODF parsing... --- Best regards, Andrzej Bialecki -- -- Hong-Thai
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178036#comment-14178036 ] Lewis John McGibbney commented on TIKA-1423: Hi [~vinegh] how is this coming on? Would you like a hand? It would be great to get this in to Tika 1.7 Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Priority: Critical Labels: features, newbie Fix For: 1.7 Attachments: GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Tika 1.6 update in Maven Central?
Hi Chris, On Mon, Oct 20, 2014 at 11:37 PM, dev-digest-h...@tika.apache.org wrote: We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully working on Windows first. Any one of the other devs having things we should get into 1.7? I would very much like to see https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are nearly there, we merely need to write unit tests, document methods, build this into a patch and submit it ti Jira for review. I will work with Vineet to get this straightened out. Thanks Lewis
Re: Tika 1.6 update in Maven Central?
Thanks Lewis! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Monday, October 20, 2014 at 11:48 PM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: Tika 1.6 update in Maven Central? Hi Chris, On Mon, Oct 20, 2014 at 11:37 PM, dev-digest-h...@tika.apache.org wrote: We do need to make a 1.7 release. I¹d like to get TIKA-1422 fully working on Windows first. Any one of the other devs having things we should get into 1.7? I would very much like to see https://issues.apache.org/jira/browse/TIKA-1423 get into 1.7. We are nearly there, we merely need to write unit tests, document methods, build this into a patch and submit it ti Jira for review. I will work with Vineet to get this straightened out. Thanks Lewis
[jira] [Assigned] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned TIKA-1423: -- Assignee: Lewis John McGibbney Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Lewis John McGibbney Priority: Critical Labels: features, newbie Fix For: 1.7 Attachments: GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1423) Build a parser to extract data from GRIB formats
[ https://issues.apache.org/jira/browse/TIKA-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178052#comment-14178052 ] Vineet Ghatge commented on TIKA-1423: - Hey [~lewismc] I am working on it, I will post my updates this week. You can assign this to me. Build a parser to extract data from GRIB formats Key: TIKA-1423 URL: https://issues.apache.org/jira/browse/TIKA-1423 Project: Tika Issue Type: New Feature Components: metadata, mime, parser Affects Versions: 1.6 Reporter: Vineet Ghatge Assignee: Lewis John McGibbney Priority: Critical Labels: features, newbie Fix For: 1.7 Attachments: GribParser.java, NLDAS_FORA0125_H.A20130112.1200.002.grb, gdas1.forecmwf.2014062612.grib2 Arctic dataset contains a MIME format called GRIB - General Regularlydistributed information in Binary form http://en.wikipedia.org/wiki/GRIB . GRIB is a well known data format which is a concise data format used in meteorology to store historical and weather data. There are 2 different types of the format GRIB 0, GRIB 2. The focus will be on GRIB 2 which is the most prevalent. Each GRIB record intended for either transmission or storage contains a single parameter with values located at an array of grid points, or represented as a set of spectral coefficients, for a single level (or layer), encoded as a continuous bit stream. Logical divisions of the record are designated as sections, each of which provides control information and/or data. A GRIB record consists of six sections, two of which are optional: (0) Indicator Section (1) Product Definition Section (PDS) (2) Grid Description Section (GDS) optional (3) Bit Map Section (BMS) optional (4) Binary Data Section (BDS) (5) '' (ASCII Characters) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186 ] Hong-Thai Nguyen commented on TIKA-1422: Applied latest fix on r1633325 with some formatting. Thank org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1
tika-trunk-jdk1.7 - Build # 273 - Failure
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #273) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/273/ to view the results.
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178197#comment-14178197 ] Hudson commented on TIKA-1422: -- FAILURE: Integrated in tika-trunk-jdk1.7 #273 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/273/]) TIKA-1422 - Apply fix of [~olegt] in Windows (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633325) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84
[jira] [Comment Edited] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178186#comment-14178186 ] Hong-Thai Nguyen edited comment on TIKA-1422 at 10/21/14 9:48 AM: -- Applied latest fix on r1633325 r161 with some formatting. Thank was (Author: thaichat04): Applied latest fix on r1633325 with some formatting. Thank org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178226#comment-14178226 ] Hudson commented on TIKA-1422: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #253 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/253/]) TIKA-1422 - Fixing build minor refactory of naming test class (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=161) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java TIKA-1422 - Apply fix of [~olegt] in Windows (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633325) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178257#comment-14178257 ] Hudson commented on TIKA-1422: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #274 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/274/]) TIKA-1422 - Fixing build minor refactory of naming test class (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=161) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76
[jira] [Created] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted
Abhishek created TIKA-1452: -- Summary: parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted Key: TIKA-1452 URL: https://issues.apache.org/jira/browse/TIKA-1452 Project: Tika Issue Type: Bug Components: detector, metadata, parser Affects Versions: 1.6 Environment: jre6 Reporter: Abhishek I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/delete/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename or delete the file. The Exception thrown while checking the content type is java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160) at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144) at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
[ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178360#comment-14178360 ] Hudson commented on TIKA-1311: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #275 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/275/]) clean up from TIKA-1311 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633357) * /tika/trunk/tika-app/src/main/java/org/apache/tika/io Centralize JSON handling of Metadata Key: TIKA-1311 URL: https://issues.apache.org/jira/browse/TIKA-1311 Project: Tika Issue Type: Task Reporter: Tim Allison Priority: Minor Fix For: 1.6 Attachments: TIKA-1311.patch When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to centralize JSON handling of Metadata, potentially putting it in core. On a recent bug fix (TIKA-1291), the same recommendation was repeated especially noting that we now handle JSON/Metadata differently in CLI and server. Let's centralize JSON handling in core and use GSON. We should add a serializer and a deserializer so that users don't have to reinvent that wheel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361 ] Andrew Jackson commented on TIKA-1302: -- Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted
[ https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178362#comment-14178362 ] Nick Burch commented on TIKA-1452: -- Can you provide a junit test case that shows how to reproduce the issue? Also, have you thought about fixing your underlying missing dependency issue? parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted -- Key: TIKA-1452 URL: https://issues.apache.org/jira/browse/TIKA-1452 Project: Tika Issue Type: Bug Components: detector, metadata, parser Affects Versions: 1.6 Environment: jre6 Reporter: Abhishek I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/delete/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename or delete the file. The Exception thrown while checking the content type is java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160) at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144) at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178361#comment-14178361 ] Andrew Jackson edited comment on TIKA-1302 at 10/21/14 12:59 PM: - Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? EDIT: Oh, and I meant to say, I'm glad to hear about [~gostep] and [~talli...@apache.org]'s efforts to run this on GovDocs, and would be interested in comparing results. We already publish format profile data about web archives, and would love to have more data to refer to. was (Author: anjackson): Okay, so the c.300,000 exceptions are here: https://www.dropbox.com/s/ka19fguaxflp725/parse_errors.csv.gz?dl=0 - let me know if you'd like it placed elsewhere (it's 14MB of compressed CSV). This conversation has helped me spot a gap in our code. We currently do a Tika.detect() before we do a Tika.parse(), and only do the latter if the former succeeded. Sadly, the version of the code that I used to generate this data did not record the Tika exception for the .detect() step, only the .parse() step. This will explain why there are no hung-thread events in this result set - the interrupted .detect() was not recorded properly. We'll be re-running this scan soonish, so I'll make sure the next version records all the exceptions. IIRC, from looking at the MIME types, the permanent hangs were mostly ZIPs, Office documents, and maybe some PDFs. Note that the CSV includes the Content-Type from the .detect() step, and this should indicate which module was run on the resource (i.e. whatever the Tika 1.5 mapping was for that MIME type). I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). I'm not sure it's worth giving you all the SAX exceptions, as there are a lot of repeats of the same problems. I think a random sample of about 50,000 should be plenty. Does that sound okay to you? Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
Hi Hong-Thai, These commits look strange to me - it looks like it subtracts the whole files (and the unit test removed the test file, renamed it, and then added what largely looks like the same file, back?) Any idea what¹s up? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: thaicha...@apache.org thaicha...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 2:32 AM To: comm...@tika.apache.org comm...@tika.apache.org Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java Author: thaichat04 Date: Tue Oct 21 09:32:06 2014 New Revision: 1633325 URL: http://svn.apache.org/r1633325 Log: TIKA-1422 - Apply fix of [~olegt] in Windows Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822Pa rserTest.java Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java URL: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=1633 325view=diff == --- tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java (original) +++ tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java Tue Oct 21 09:32:06 2014 @@ -26,11 +26,11 @@ import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; +import java.util.ArrayList; import java.util.HashSet; +import java.util.List; import java.util.Map; import java.util.Set; -import java.util.List; -import java.util.ArrayList; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.FutureTask; @@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.Parser; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; import org.apache.tika.parser.external.ExternalParser; +import org.apache.tika.parser.image.ImageParser; +import org.apache.tika.parser.image.PSDParser; +import org.apache.tika.parser.image.TiffParser; +import org.apache.tika.parser.jpeg.JpegParser; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** - * TesseractOCRParser powered by tesseract-ocr engine. - * To enable this parser, create a {@link TesseractOCRConfig} - * object and pass it through a ParseContext. - * Tesseract-ocr must be installed and on system path or - * the path to its root folder must be provided: + * TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, + * create a {@link TesseractOCRConfig} object and pass it through a + * ParseContext. Tesseract-ocr must be installed and on system path or the path + * to its root folder must be provided: * p * TesseractOCRConfig config = new TesseractOCRConfig();br * //Needed if tesseract is not on system pathbr @@ -69,226 +72,231 @@ import org.xml.sax.SAXException; * */ public class TesseractOCRParser extends AbstractParser { - - private static final long serialVersionUID = 1L; - - private static final SetMediaType SUPPORTED_TYPES = getTypes(); - - private static SetMediaType getTypes() { - HashSetMediaType supportedTypes = new HashSetMediaType(); - - supportedTypes.add(MediaType.image(png)); - supportedTypes.add(MediaType.image(jpeg)); - supportedTypes.add(MediaType.image(tiff)); - supportedTypes.add(MediaType.image(x-ms-bmp)); - supportedTypes.add(MediaType.image(gif)); - - return supportedTypes; - } - - @Override - public SetMediaType getSupportedTypes(ParseContext arg0) { - return SUPPORTED_TYPES; - } - -private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) { -
Re: Tika 1.6 update in Maven Central?
Thanks Chris. Any ideas when that is likely to happen? I'm trying to determine whether I can wait for a 1.7 release. If not, I think my only option to avoid the uncontrolled build up of tmp files (when processing .7z archives) would be to go back to 1.5. Regards, Aeham
parser.parse() throws exception after which the procesed file is not getting renamed/moved.
I am passing a file as input stream to parser.parse() method while using apache tika library to convert file to text.The method throws an exception (displayed below) but the input stream is closed in the finally block successfully. Then while renaming the file, the File.renameTo method from java.io returns false. I am not able to rename/delete/move the file despite successfully closing the inputStream. I am afraid another instance of file is created, while parser.parse() method processess the file, which doesn't get closed till the time exception is throw. Is that possible? If so what should I do to rename or delete the file. The Exception thrown while checking the content type is java.lang.NoClassDefFoundError: Could not initialize class com.adobe.xmp.impl.XMPMetaParser at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160) at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144) at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) -- View this message in context: http://lucene.472066.n3.nabble.com/parser-parse-throws-exception-after-which-the-procesed-file-is-not-getting-renamed-moved-tp4165153.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
Hi Chris, Yes, I made a mistake on this commit by missing a renaming file and broke build, the next commit corrected: Revision: 161 Author: thaichat04 Date: mardi 21 octobre 2014 11:47:54 Message: TIKA-1422 - Fixing build minor refactory of naming test class Modified : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java Added : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Deleted : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java Please 'pull' latest again then tell me if OK ? Sorry On Tue, Oct 21, 2014 at 3:49 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Hong-Thai, These commits look strange to me - it looks like it subtracts the whole files (and the unit test removed the test file, renamed it, and then added what largely looks like the same file, back?) Any idea what¹s up? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: thaicha...@apache.org thaicha...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 2:32 AM To: comm...@tika.apache.org comm...@tika.apache.org Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java Author: thaichat04 Date: Tue Oct 21 09:32:06 2014 New Revision: 1633325 URL: http://svn.apache.org/r1633325 Log: TIKA-1422 - Apply fix of [~olegt] in Windows Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822Pa rserTest.java Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java URL: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=1633 325view=diff == --- tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java (original) +++ tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract OCRParser.java Tue Oct 21 09:32:06 2014 @@ -26,11 +26,11 @@ import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; +import java.util.ArrayList; import java.util.HashSet; +import java.util.List; import java.util.Map; import java.util.Set; -import java.util.List; -import java.util.ArrayList; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.FutureTask; @@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.Parser; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; import org.apache.tika.parser.external.ExternalParser; +import org.apache.tika.parser.image.ImageParser; +import org.apache.tika.parser.image.PSDParser; +import org.apache.tika.parser.image.TiffParser; +import org.apache.tika.parser.jpeg.JpegParser; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** - * TesseractOCRParser powered by tesseract-ocr engine. - * To enable this parser, create a {@link TesseractOCRConfig} - * object and pass it through a ParseContext. - * Tesseract-ocr must be installed and on system path or - * the path to its root folder must be provided: + * TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, + * create a {@link TesseractOCRConfig} object and pass it through a + * ParseContext. Tesseract-ocr must be installed and on system path or the path + * to its root folder must be provided: * p * TesseractOCRConfig config = new TesseractOCRConfig();br * //Needed if tesseract is not on system pathbr @@ -69,226 +72,231 @@ import org.xml.sax.SAXException; * */ public class TesseractOCRParser extends AbstractParser { - - private static final long
Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
No worries Hong-Thai! Will update and test, thanks! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Hong-Thai Nguyen thaicha...@gmail.com Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 6:57 AM To: dev@tika.apache.org dev@tika.apache.org Subject: Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java Hi Chris, Yes, I made a mistake on this commit by missing a renaming file and broke build, the next commit corrected: Revision: 161 Author: thaichat04 Date: mardi 21 octobre 2014 11:47:54 Message: TIKA-1422 - Fixing build minor refactory of naming test class Modified : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822P arserTest.java Added : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tesserac tOCRParserTest.java Deleted : /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tesserac tOCRTest.java Please 'pull' latest again then tell me if OK ? Sorry On Tue, Oct 21, 2014 at 3:49 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Hong-Thai, These commits look strange to me - it looks like it subtracts the whole files (and the unit test removed the test file, renamed it, and then added what largely looks like the same file, back?) Any idea what¹s up? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: thaicha...@apache.org thaicha...@apache.org Reply-To: dev@tika.apache.org dev@tika.apache.org Date: Tuesday, October 21, 2014 at 2:32 AM To: comm...@tika.apache.org comm...@tika.apache.org Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java Author: thaichat04 Date: Tue Oct 21 09:32:06 2014 New Revision: 1633325 URL: http://svn.apache.org/r1633325 Log: TIKA-1422 - Apply fix of [~olegt] in Windows Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera ct OCRParser.java tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822 Pa rserTest.java Modified: tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera ct OCRParser.java URL: http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/ap a che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325r1=1633324r2=16 33 325view=diff == --- tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera ct OCRParser.java (original) +++ tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tessera ct OCRParser.java Tue Oct 21 09:32:06 2014 @@ -26,11 +26,11 @@ import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; +import java.util.ArrayList; import java.util.HashSet; +import java.util.List; import java.util.Map; import java.util.Set; -import java.util.List; -import java.util.ArrayList; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.FutureTask; @@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.Parser; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; import org.apache.tika.parser.external.ExternalParser; +import org.apache.tika.parser.image.ImageParser; +import org.apache.tika.parser.image.PSDParser; +import
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178866#comment-14178866 ] Tyler Palsulich commented on TIKA-1422: --- {code} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 546, Failures: 1, Errors: 0, Skipped: 4 {code} {code} Wanted 5 times but was 4 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:93) Caused by: org.mockito.exceptions.cause.TooLittleInvocations: {code} Still getting a failing test with Tesseract 3.02.02 installed on Mac. Will look into this more tomorrow. But, thank you, [~o...@apache.org] and [~thaichat04]! org.apache.tika.parser.mail.RFC822ParserTest fails -- Key: TIKA-1422 URL: https://issues.apache.org/jira/browse/TIKA-1422 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1422.Mattmann.100114.patch.txt, TIKA-1422.Mattmann.100414.patch.txt, TIKA-1422.oleg.20141021.patch, TIKA-1422.palsulich.100414.patch, TIKA-1422.palsulich.100714.patch I'm seeing test failures from: {noformat} Results : Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): (..) Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 {noformat} CentOS6 VM image, running: {noformat} [mattmann@memex tika]$ java -version java version 1.7.0_67 Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) [mattmann@memex tika]$ mvn -version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T09:37:52-08:00) Maven home: /usr/share/apache-maven Java version: 1.7.0_65, vendor: Oracle Corporation Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 2.6.32-431.23.3.el6.centos.plus.x86_64, arch: amd64, family: unix [mattmann@memex tika]$ {noformat} Here are the surefire reports - no clue what's up here: {noformat} [mattmann@memex tika]$ more tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt --- Test set: org.apache.tika.parser.mail.RFC822ParserTest --- Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec FAILURE! testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.152 sec FAILURE! org.mockito.exceptions.verification.TooManyActualInvocations: xHTMLContentHandler.startElement( http://www.w3.org/1999/xhtml;, div, div, isA(org.xml.sax.Attributes) ); Wanted 4 times but was 5 at org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) Caused by: org.mockito.exceptions.cause.UndesiredInvocation: Undesired invocation: at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) at org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) at org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76
[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178889#comment-14178889 ] Bin Hawking commented on TIKA-1446: --- The above attached is my fix, which is the old or new code in tika-1.6\tika-parsers\src\main\java\org\apache\tika\parser\chm\ Please use diff to see my changes. This fix addresses TIKA- 1430, 1446, 1447, 1448. NOTE: My fix is not well tested and may be incomplete. And, because I am adding new features to the chm parser for my own application,including parsing HHK and HHC files for more metadata; there are some distractions in my revisions which are not applicable to the original tika project. Sorry for the inconvenience. CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bin Hawking updated TIKA-1446: -- Attachment: chm.zip CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical Attachments: chm.zip If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bin Hawking updated TIKA-1446: -- Attachment: (was: chm.zip) CHM parser : wrong decompression of aligned blocks -- Key: TIKA-1446 URL: https://issues.apache.org/jira/browse/TIKA-1446 Project: Tika Issue Type: Bug Affects Versions: 1.7 Reporter: Bin Hawking Priority: Critical If an embedded file contains aligned blocks, the parser outputs chaotic text or empty text as to this file. I have fixed it myself, corrected decompressAlignedBlock() and its preparation methods. Mostly this bug is due to misusing main tree/align tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
import (re)ordering?
All, I have Intellij set to order imports by javax, java, then other. I think this is the most common pattern in Tika. Is it ok if I make these (meaningless/formatting) changes when I commit other changes? Thank you. Best, Tim
[jira] [Created] (TIKA-1453) fails to parse RFC3464 documents
Rob Tulloh created TIKA-1453: Summary: fails to parse RFC3464 documents Key: TIKA-1453 URL: https://issues.apache.org/jira/browse/TIKA-1453 Project: Tika Issue Type: Bug Affects Versions: 1.5 Reporter: Rob Tulloh Priority: Minor Tika 1.5 does not support content-type message/delivery-status http://tools.ietf.org/html/rfc3464 Notes from Oracle indicate that javamail now supports this RFC. curl -H Content-Type:message/delivery-status -T /tmp/xxx http://localhost:9998/tika Produces 2014-10-21_21:06:40.23890 Oct 21, 2014 4:06:40 PM org.apache.tika.server.TikaResource logRequest 2014-10-21_21:06:40.23894 INFO: tika (message/delivery-status) 2014-10-21_21:06:40.23994 Oct 21, 2014 4:06:40 PM org.apache.tika.server.TikaResource$3 write 2014-10-21_21:06:40.23995 WARNING: tika: Text extraction failed 2014-10-21_21:06:40.23996 org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.TikaResource$1@dae96f2 2014-10-21_21:06:40.23997 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) 2014-10-21_21:06:40.23997 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 2014-10-21_21:06:40.23998 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) 2014-10-21_21:06:40.23998 at org.apache.tika.server.TikaResource$3.write(TikaResource.java:196) 2014-10-21_21:06:40.23999 at org.apache.cxf.jaxrs.provider.BinaryDataProvider.writeTo(BinaryDataProvider.java:118) 2014-10-21_21:06:40.23999 at org.apache.cxf.jaxrs.utils.JAXRSUtils.writeMessageBody(JAXRSUtils.java:1317) 2014-10-21_21:06:40.24000 at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.serializeMessage(JAXRSOutInterceptor.java:284) 2014-10-21_21:06:40.24000 at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.processResponse(JAXRSOutInterceptor.java:157) 2014-10-21_21:06:40.24001 at org.apache.cxf.jaxrs.interceptor.JAXRSOutInterceptor.handleMessage(JAXRSOutInterceptor.java:86) 2014-10-21_21:06:40.24002 at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272) 2014-10-21_21:06:40.24003 at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:77) 2014-10-21_21:06:40.24004 at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272) 2014-10-21_21:06:40.24007 at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) 2014-10-21_21:06:40.24008 at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:355) 2014-10-21_21:06:40.24009 at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:319) 2014-10-21_21:06:40.24009 at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72) 2014-10-21_21:06:40.24010 at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) 2014-10-21_21:06:40.24010 at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) 2014-10-21_21:06:40.24011 at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) 2014-10-21_21:06:40.24011 at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) 2014-10-21_21:06:40.24012 at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) 2014-10-21_21:06:40.24013 at org.eclipse.jetty.server.Server.handle(Server.java:370) 2014-10-21_21:06:40.24014 at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) 2014-10-21_21:06:40.24014 at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) 2014-10-21_21:06:40.24015 at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) 2014-10-21_21:06:40.24015 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651) 2014-10-21_21:06:40.24016 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: import (re)ordering?
On Tue, 21 Oct 2014, Allison, Timothy B. wrote: I have Intellij set to order imports by javax, java, then other. I think this is the most common pattern in Tika. Is it ok if I make these (meaningless/formatting) changes when I commit other changes? The only downside of this is that the top of the commit message is then all noise, so it's less likely that people will end up skipping the review of the meat of the commit It's not always possible, but where you can, it's generally best to split up tidy up / formatting commits (whitespace, imports, formatting etc) ones from ones that touch functionality. Cheers Nick
[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui
[ https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179422#comment-14179422 ] Hudson commented on TIKA-1451: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #276 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/276/]) TIKA-1451 add RecursiveParserWrapper output to CLI and GUI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633499) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-app/src/test/resources/test-data/test_recursive_embedded.docx * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadata.java * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataBase.java * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java * /tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataListTest.java * /tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataTest.java Add Recursive Metadata Parser Wrapper output to tika-app and gui Key: TIKA-1451 URL: https://issues.apache.org/jira/browse/TIKA-1451 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.7 Attachments: integrate_recursive_metadata_wrapper.patch It would be helpful to expose the output of the recursive metadata parser wrapper in the gui and in the command line for tika-app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1451) Add Recursive Metadata Parser Wrapper output to tika-app and gui
[ https://issues.apache.org/jira/browse/TIKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179441#comment-14179441 ] Hudson commented on TIKA-1451: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #255 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/255/]) TIKA-1451 add RecursiveParserWrapper output to CLI and GUI (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1633499) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java * /tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java * /tika/trunk/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java * /tika/trunk/tika-app/src/test/resources/test-data/test_recursive_embedded.docx * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadata.java * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataBase.java * /tika/trunk/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java * /tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataListTest.java * /tika/trunk/tika-serialization/src/test/java/org/apache/tika/metadata/serialization/JsonMetadataTest.java Add Recursive Metadata Parser Wrapper output to tika-app and gui Key: TIKA-1451 URL: https://issues.apache.org/jira/browse/TIKA-1451 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.7 Attachments: integrate_recursive_metadata_wrapper.patch It would be helpful to expose the output of the recursive metadata parser wrapper in the gui and in the command line for tika-app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)