[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-03 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194543#comment-14194543 ] Tim Allison commented on TIKA-1302: --- [~anjackson], the google docs link is down at the

[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1463: -- Summary: TesseractOCRParser does work in Windows Key: TIKA-1463 URL: https://issues.apache.org/jira/browse/TIKA-1463 Project: Tika Issue Type: Bug

[jira] [Created] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)
Tim Barrett created TIKA-1464: - Summary: Too many open files in system when parsing thousands of files Key: TIKA-1464 URL: https://issues.apache.org/jira/browse/TIKA-1464 Project: Tika Issue

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Nick Burch (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194685#comment-14194685 ] Nick Burch commented on TIKA-1464: -- Firstly, make sure you're closing the InputStream /

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194691#comment-14194691 ] Tim Barrett commented on TIKA-1464: --- I double checked the input stream closing thoroughly

[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194694#comment-14194694 ] Hong-Thai Nguyen commented on TIKA-1463: Fixed in r1636382 TesseractOCRParser

[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Summary: TesseractOCRParser does not work in Windows (was: TesseractOCRParser does work in

[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1463: --- Description: STR: * Case 1: ** Setting tesseractPath to a common installation path of

[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen closed TIKA-1463. -- Resolution: Fixed TesseractOCRParser does not work in Windows

[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194712#comment-14194712 ] Hudson commented on TIKA-1463: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #297 (See

[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194733#comment-14194733 ] Hudson commented on TIKA-1463: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #277 (See

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194753#comment-14194753 ] Tim Barrett commented on TIKA-1464: --- Built using 1.7-SNAPSHOT from

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney
On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment48 Is this always available? What happens if we read an InputStream and not a File? Can we still populate Metadata.RESOURCE_NAME_KEY?

[jira] [Created] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2014-11-03 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-1465: -- Summary: Implement extraction of non-global variables from netCDF3 and netCDF4 Key: TIKA-1465 URL: https://issues.apache.org/jira/browse/TIKA-1465

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment51 Need a corresponding `xhtml.endElement(ul);`. Corrected! - Vineet Ghatge ---

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment52 Need a corresponding `xhtml.endElement(ul);`. Corrected! - Vineet Ghatge ---

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote: File Attachment: GribParser - GribParser.java https://reviews.apache.org/r/27414/#fcomment49 Formatting and TikaException message is not correct. I would suggest that we stick to GRIB parse error. Additionally, I don't

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 2, 2014, 3:01 a.m., Chris Mattmann wrote: trunk/tika-parsers/pom.xml, line 84 https://reviews.apache.org/r/27414/diff/1/?file=745304#file745304line84 shouldn't this replace the above dependency I am not sure if there are components to which depend on it. I know that netcdf

[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195206#comment-14195206 ] Luis Filipe Nassif commented on TIKA-1464: -- You can attach the file leak detector

[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Luis Filipe Nassif (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195222#comment-14195222 ] Luis Filipe Nassif commented on TIKA-1463: -- Curious, because here I can run

Parse Html with Tika

2014-11-03 Thread Linh Tang
Dear All, I am Phuong Linh, I am using Tika to extract content form Html file to search. But HtmlParser cannot parse all tag of Html. ( I get Html page by Nutch, then use Tika to extract the important information, after then use Solr to search.) Can you tell me what i can do to parse all tag of

RE: Parse Html with Tika

2014-11-03 Thread Ken Krugler
From: Linh Tang Sent: November 3, 2014 2:30:46pm PST To: dev@tika.apache.org Subject: Parse Html with Tika Dear All, I am Phuong Linh, I am using Tika to extract content form Html file to search. But HtmlParser cannot parse all tag of Html. I'm not sure what you mean by cannot

Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Chris Mattmann
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Chris Mattmann
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/ --- (Updated Nov. 4, 2014, 5:17 a.m.) Review request for tika, Lewis McGibbney,

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59728 --- Patch is looking good. I am testing - Lewis McGibbney On Nov. 4,

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Yes @mattmann, the unit test passes for me - Vineet Ghatge --- This is an automatically generated e-mail. To reply, visit:

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 4, 2014, 5:36 a.m.) Review request for tika, Lewis McGibbney,

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Vineet Ghatge Hemantkumar wrote: Yes @mattmann, the unit test passes for me what is the grib file please? Where can I find it? - Lewis

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote: Patch is looking good. I am testing Vineet Ghatge Hemantkumar wrote: Yes @mattmann, the unit test passes for me Lewis McGibbney wrote: what is the grib file please? Where can I find it? This is the grib file -

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27562/#review59734 --- OK, test is also failing for me with Tika trunk as follows 1

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27414/ --- (Updated Nov. 4, 2014, 5:48 a.m.) Review request for tika, Lewis McGibbney,

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3

Re: Parse Html with Tika

2014-11-03 Thread Julien Nioche
Hi Linh You can specify a mapper to control what the html parser will filter or not. see https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639 for an example Julien On Monday, 3 November 2014, Linh Tang ttplinh2...@gmail.com wrote: Dear All, I am

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar
On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote: OK, test is also failing for me with Tika trunk as follows 1 --- 2 Test set: org.apache.tika.parser.grib.GribParserTest 3