[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194543#comment-14194543 ] Tim Allison commented on TIKA-1302: --- [~anjackson], the google docs link is down at the moment, so I can't see the full doc. If there is any way to capture the full stacktrace so that we can compare with our govdocs1 runs, that would be fantastic. You can see our current output format comparing two versions of PDFBox over on TIKA-1442. This is ongoing work (from my perspective), and there's no need to rush. Whichever option is easier for you...thank you for sharing! {quote} I don't think we changed the parse configuration significantly, so it seems HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 100% sure about this, and will try to check). {quote} Y, if you could check, I'd be interested. I think the default behavior would be to send XML through the DcXMLParser, which is far stricter than the default HtmlParser. You can see by our choice on tika-server, though, that at least one dev prefers to have our HtmlParser handle xml. :) Thank you, again! Let's run Tika against a large batch of docs nightly Key: TIKA-1302 URL: https://issues.apache.org/jira/browse/TIKA-1302 Project: Tika Issue Type: Improvement Components: cli, general, server Reporter: Tim Allison Many thanks to [~lewismc] for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics. One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files. Any other candidate corpora? [~willp-bl], have anything handy you'd like to contribute? [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite] ;) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows
Hong-Thai Nguyen created TIKA-1463: -- Summary: TesseractOCRParser does work in Windows Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1464) Too many open files in system when parsing thousands of files
Tim Barrett created TIKA-1464: - Summary: Too many open files in system when parsing thousands of files Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Os-X 10.10, Windows 8.1 (probably all op systems) Reporter: Tim Barrett Priority: Blocker Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure. We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6. However we have seen that a Too many open files in system exception is raised somewhere above 1 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl. Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194685#comment-14194685 ] Nick Burch commented on TIKA-1464: -- Firstly, make sure you're closing the InputStream / TikaInputStream after parsing Secondly, try with a recent nightly build / build from svn, and see if that solves it. There have been some library upgrades that'll be 1.7, which may help, but you'll need to use a nightly / snapshot build until 1.7 gets released (soonish) Too many open files in system when parsing thousands of files - Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Os-X 10.10, Windows 8.1 (probably all op systems) Reporter: Tim Barrett Priority: Blocker Labels: TooManyOpenFilesInSystem Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure. We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6. However we have seen that a Too many open files in system exception is raised somewhere above 1 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl. Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194691#comment-14194691 ] Tim Barrett commented on TIKA-1464: --- I double checked the input stream closing thoroughly before reporting. Finally clauses which close the input streams exist all over the place in the code so it's pretty robust as far as that is concerned. Also please note that open files within the process remain stable under versions pre 1.6 Too many open files in system when parsing thousands of files - Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Os-X 10.10, Windows 8.1 (probably all op systems) Reporter: Tim Barrett Priority: Blocker Labels: TooManyOpenFilesInSystem Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure. We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6. However we have seen that a Too many open files in system exception is raised somewhere above 1 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl. Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194694#comment-14194694 ] Hong-Thai Nguyen commented on TIKA-1463: Fixed in r1636382 TesseractOCRParser does work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Summary: TesseractOCRParser does not work in Windows (was: TesseractOCRParser does work in Windows) TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Description: STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe was: STR: * Case 1: ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen closed TIKA-1463. -- Resolution: Fixed TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194712#comment-14194712 ] Hudson commented on TIKA-1463: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #297 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/297/]) TIKA-1463 - Fix tesseractPath in Windows (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1636382) * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194733#comment-14194733 ] Hudson commented on TIKA-1463: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #277 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/277/]) TIKA-1463 - Fix tesseractPath in Windows (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1636382) * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194753#comment-14194753 ] Tim Barrett commented on TIKA-1464: --- Built using 1.7-SNAPSHOT from http://repository.apache.org/snapshots - there appear now to be fewer open files, but the amount still grows and eventually reaches too many open files. Too many open files in system when parsing thousands of files - Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Os-X 10.10, Windows 8.1 (probably all op systems) Reporter: Tim Barrett Priority: Blocker Labels: TooManyOpenFilesInSystem Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure. We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6. However we have seen that a Too many open files in system exception is raised somewhere above 1 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl. Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 27414: GRIB Parser for TIKA
On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment48 Is this always available? What happens if we read an InputStream and not a File? Can we still populate Metadata.RESOURCE_NAME_KEY? Tyler Palsulich wrote: I think a better solution would be to create a TikaInputStream, then grab a temporary file from that. See TikaInputStream#get(InputStream, TemporaryResources) and TikaInputStream#getFile(). Then the Parser won't be dependent on a Metadata field. +1 Tyler - Lewis --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/#review59346 --- On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 2, 2014, 3:17 p.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
[jira] [Created] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4
Lewis John McGibbney created TIKA-1465: -- Summary: Implement extraction of non-global variables from netCDF3 and netCDF4 Key: TIKA-1465 URL: https://issues.apache.org/jira/browse/TIKA-1465 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.6 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.7 Speaking to Eric Nienhouse at the ongoing NSF funded Polar Cyberinfrastructure hackathon in NYC, we became aware that variables parameters contained within netCDF3 and netCDF4 are just as valuable (if not more valuable) as global attribute values. AFAIK, right now we only extract global attributes however we could extend the support to cater for the above observations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 27414: GRIB Parser for TIKA
On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment51 Need a corresponding `xhtml.endElement(ul);`. Corrected! - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/#review59526 --- On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 2, 2014, 3:17 p.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
Re: Review Request 27414: GRIB Parser for TIKA
On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment52 Need a corresponding `xhtml.endElement(ul);`. Corrected! - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/#review59526 --- On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 2, 2014, 3:17 p.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
Re: Review Request 27414: GRIB Parser for TIKA
On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment49 Formatting and TikaException message is not correct. I would suggest that we stick to GRIB parse error. Additionally, I don't know if it is wise for us to have such a long try catch scenario! Added multiple try catch for different sections On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment50 Of courser this is not 100% correct as in this case the underlying library is being used to parse GRIB2 files... correct? Corrected the links and yes we are using to parse grib2 files - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/#review59346 --- On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 2, 2014, 3:17 p.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
Re: Review Request 27414: GRIB Parser for TIKA
On Nov. 2, 2014, 3:01 a.m., Chris Mattmann wrote: trunk/tika-parsers/pom.xml, line 84 https://reviews.apache.org/r/27414/diff/1/?file=745304#file745304line84 shouldn't this replace the above dependency I am not sure if there are components to which depend on it. I know that netcdf still depends on the old version of the jar - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/#review59513 --- On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 2, 2014, 3:17 p.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195206#comment-14195206 ] Luis Filipe Nassif commented on TIKA-1464: -- You can attach the file leak detector agent (http://file-leak-detector.kohsuke.org/) to the running JVM to track where the handles were opened. The http server feature was very useful to me. Too many open files in system when parsing thousands of files - Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Environment: Os-X 10.10, Windows 8.1 (probably all op systems) Reporter: Tim Barrett Priority: Blocker Labels: TooManyOpenFilesInSystem Our big data project parses many thousands of different kinds of files sequentially. Up to and including Tika 1.5 this has been trouble free and Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in roughly equal measure. We switched to Tika 1.6 last week and this was a good enhancement for us as a number of files (MSOffice) that previously failed to parse do now parse correctly under Tika 1.6. However we have seen that a Too many open files in system exception is raised somewhere above 1 files having been parsed. On a windows server this exception is not raised but the system eventually begins to crawl. Watching the system's behaviour with the apache tmp files we see that the apache tika files *are* being deleted from the file system, but lsof is showing all these files as remaining open by the running process using Tika. It would appear that the files are being deleted but handles to these files are not being cleared. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195222#comment-14195222 ] Luis Filipe Nassif commented on TIKA-1463: -- Curious, because here I can run tesseract command on windows 7 without the .exe extension. TesseractOCRParser does not work in Windows --- Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug Reporter: Hong-Thai Nguyen STR: * Case 1: ** Setting tesseractPath to a common installation path of Tesseract: C:\Program Files (x86)\Tesseract-OCR ** the checking available Tesseract command returns always false * Case 2: ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR ** the checking running command of tesseract on Windows is not correct: C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Parse Html with Tika
Dear All, I am Phuong Linh, I am using Tika to extract content form Html file to search. But HtmlParser cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to extract the important information, after then use Solr to search.) Can you tell me what i can do to parse all tag of html. Thanks advance! Regards, Tang Thi Phuong Linh. -- P.Linh
RE: Parse Html with Tika
From: Linh Tang Sent: November 3, 2014 2:30:46pm PST To: dev@tika.apache.org Subject: Parse Html with Tika Dear All, I am Phuong Linh, I am using Tika to extract content form Html file to search. But HtmlParser cannot parse all tag of Html. I'm not sure what you mean by cannot parse all tag of Html. Do you have an example of an HTML page, and text that isn't being extracted? -- Ken ( I get Html page by Nutch, then use Tika to extract the important information, after then use Solr to search.) Can you tell me what i can do to parse all tag of html. Thanks advance! Regards, Tang Thi Phuong Linh. -- P.Linh -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Review Request 27562: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing --- update for #27414 Thanks, Chris Mattmann
Re: Review Request 27562: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing (updated) --- update for #27414 FYI for the life of me, I can't get the unit test to pass (was this working for you @Vinegh?) Patch is fully up to date with trunk and compiles at least. Thanks, Chris Mattmann
Re: Review Request 27562: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59728 --- Patch is looking good. I am testing - Lewis McGibbney On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing --- update for #27414 FYI for the life of me, I can't get the unit test to pass (was this working for you @Vinegh?) Patch is fully up to date with trunk and compiles at least. Thanks, Chris Mattmann
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Yes @mattmann, the unit test passes for me - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59728 --- On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing --- update for #27414 FYI for the life of me, I can't get the unit test to pass (was this working for you @Vinegh?) Patch is fully up to date with trunk and compiles at least. Thanks, Chris Mattmann
Re: Review Request 27414: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 4, 2014, 5:36 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Vineet Ghatge Hemantkumar wrote: Yes @mattmann, the unit test passes for me what is the grib file please? Where can I find it? - Lewis --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59728 --- On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing --- update for #27414 FYI for the life of me, I can't get the unit test to pass (was this working for you @Vinegh?) Patch is fully up to date with trunk and compiles at least. Thanks, Chris Mattmann
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Vineet Ghatge Hemantkumar wrote: Yes @mattmann, the unit test passes for me Lewis McGibbney wrote: what is the grib file please? Where can I find it? This is the grib file - gdas1.forecmwf.2014062612.grib2 and this is under the following location ./trunk/tika-parsers/src/test/resources/test-documents/ - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59728 --- On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java PRE-CREATION ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2 UNKNOWN Diff: https://reviews.apache.org/r/27562/diff/ Testing --- update for #27414 FYI for the life of me, I can't get the unit test to pass (was this working for you @Vinegh?) Patch is fully up to date with trunk and compiles at least. Thanks, Chris Mattmann
Re: Review Request 27562: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59734 --- OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3 --- 4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec FAILURE! 5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest) Time elapsed: 0.075 sec ERROR! 6 org.apache.tika.exception.TikaException: NetCDF parse error 7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134) 8 at org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52) 9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 10 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 11 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12 at java.lang.reflect.Method.invoke(Method.java:606) 13 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) 14 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 15 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) 16 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) 18 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) 19 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309) 26 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) 27 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) 28 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 30 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 31 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 32 at java.lang.reflect.Method.invoke(Method.java:606) 33 at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) 34 at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) 35 at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) 36 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) 37 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: not a valid CDM file. 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803) 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719) 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86) 42 ... 30 more We need to do more work here, also there are a number of issue which need to be addressed and carried over from th previous issue I feel. - Lewis McGibbney On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java PRE-CREATION ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser 1636144
Re: Review Request 27414: GRIB Parser for TIKA
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 4, 2014, 5:48 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - trunk/tika-parsers/pom.xml 1635045 Diff: https://reviews.apache.org/r/27414/diff/ Testing --- To test the parser in place 1. Download the patch and three file - GribParserTest.java, GribParser.java and gdas.forecmwf 2. Put the GribParser.java in the following folder - tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have folder named grib here 3. Put the GribParserTest.java in the following folder - tika-parsers/src/test/java/org/apache/tika/parser/grib 4. Put the resource file in the following location - tika-parsers/src/test/resources/test-documents/ 5. Apply the patch and build. File Attachments ParserTestFile https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java GribParser https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java Resource file https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2 Thanks, Vineet Ghatge Hemantkumar
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3 --- 4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec FAILURE! 5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest) Time elapsed: 0.075 sec ERROR! 6 org.apache.tika.exception.TikaException: NetCDF parse error 7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134) 8 at org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52) 9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 10 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 11 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12 at java.lang.reflect.Method.invoke(Method.java:606) 13 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) 14 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 15 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) 16 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) 18 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) 19 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309) 26 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) 27 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) 28 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 30 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 31 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 32 at java.lang.reflect.Method.invoke(Method.java:606) 33 at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) 34 at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) 35 at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) 36 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) 37 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: not a valid CDM file. 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803) 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719) 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86) 42 ... 30 more We need to do more work here, also there are a number of issue which need to be addressed and carried over from th previous issue I feel. I will see why its failing. - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59734 --- On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet Ghatge Hemantkumar. Bugs: tika-1423 https://issues.apache.org/jira/browse/tika-1423 Repository: tika Description --- GRIB Parser Patch Diffs - ./trunk/tika-parsers/pom.xml 1636144
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3 --- 4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec FAILURE! 5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest) Time elapsed: 0.075 sec ERROR! 6 org.apache.tika.exception.TikaException: NetCDF parse error 7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134) 8 at org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52) 9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 10 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 11 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12 at java.lang.reflect.Method.invoke(Method.java:606) 13 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) 14 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 15 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) 16 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) 18 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) 19 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309) 26 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) 27 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) 28 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 30 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 31 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 32 at java.lang.reflect.Method.invoke(Method.java:606) 33 at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) 34 at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) 35 at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) 36 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) 37 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: not a valid CDM file. 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803) 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719) 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86) 42 ... 30 more We need to do more work here, also there are a number of issue which need to be addressed and carried over from th previous issue I feel. Vineet Ghatge Hemantkumar wrote: I will see why its failing. In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, gdas1.forecmwf.2014062612.grib2); this shoudl be the following metadata.set(Metadata.RESOURCE_NAME_KEY, /test-documents/gdas1.forecmwf.2014062612.grib2); - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59734 --- On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and Vineet
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3 --- 4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec FAILURE! 5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest) Time elapsed: 0.075 sec ERROR! 6 org.apache.tika.exception.TikaException: NetCDF parse error 7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134) 8 at org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52) 9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 10 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 11 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12 at java.lang.reflect.Method.invoke(Method.java:606) 13 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) 14 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 15 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) 16 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) 18 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) 19 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309) 26 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) 27 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) 28 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 30 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 31 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 32 at java.lang.reflect.Method.invoke(Method.java:606) 33 at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) 34 at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) 35 at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) 36 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) 37 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: not a valid CDM file. 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803) 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719) 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86) 42 ... 30 more We need to do more work here, also there are a number of issue which need to be addressed and carried over from th previous issue I feel. Vineet Ghatge Hemantkumar wrote: I will see why its failing. Vineet Ghatge Hemantkumar wrote: In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, gdas1.forecmwf.2014062612.grib2); this shoudl be the following metadata.set(Metadata.RESOURCE_NAME_KEY, /test-documents/gdas1.forecmwf.2014062612.grib2); Further, I am using the netcdfall 4.5 jar which is not what is there in repo, When I run from command line using the following coommand javac -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib.java, java -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib it works. I am trying rebuild in tika - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59734 --- On Nov. 4, 2014,
Re: Parse Html with Tika
Hi Linh You can specify a mapper to control what the html parser will filter or not. see https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639 for an example Julien On Monday, 3 November 2014, Linh Tang ttplinh2...@gmail.com wrote: Dear All, I am Phuong Linh, I am using Tika to extract content form Html file to search. But HtmlParser cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to extract the important information, after then use Solr to search.) Can you tell me what i can do to parse all tag of html. Thanks advance! Regards, Tang Thi Phuong Linh. -- P.Linh -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: Review Request 27562: GRIB Parser for TIKA
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3 --- 4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec FAILURE! 5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest) Time elapsed: 0.075 sec ERROR! 6 org.apache.tika.exception.TikaException: NetCDF parse error 7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134) 8 at org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52) 9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 10 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 11 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 12 at java.lang.reflect.Method.invoke(Method.java:606) 13 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) 14 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 15 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) 16 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) 18 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) 19 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309) 26 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236) 27 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134) 28 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113) 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 30 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 31 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 32 at java.lang.reflect.Method.invoke(Method.java:606) 33 at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189) 34 at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165) 35 at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85) 36 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103) 37 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74) 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: not a valid CDM file. 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803) 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719) 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86) 42 ... 30 more We need to do more work here, also there are a number of issue which need to be addressed and carried over from th previous issue I feel. Vineet Ghatge Hemantkumar wrote: I will see why its failing. Vineet Ghatge Hemantkumar wrote: In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, gdas1.forecmwf.2014062612.grib2); this shoudl be the following metadata.set(Metadata.RESOURCE_NAME_KEY, /test-documents/gdas1.forecmwf.2014062612.grib2); Vineet Ghatge Hemantkumar wrote: Further, I am using the netcdfall 4.5 jar which is not what is there in repo, When I run from command line using the following coommand javac -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib.java, java -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib it works. I am trying rebuild in tika I am trying to apply the patch and it keeps erroring out that this not a valid patch? - Vineet Ghatge --- This is an automatically generated e-mail. To reply,