[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-03 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194543#comment-14194543
 ] 

Tim Allison commented on TIKA-1302:
---

[~anjackson], the google docs link is down at the moment, so I can't see the 
full doc.  If there is any way to capture the full stacktrace so that we can 
compare with our govdocs1 runs, that would be fantastic.  You can see our 
current output format comparing two versions of PDFBox over on TIKA-1442. This 
is ongoing work (from my perspective), and there's no need to rush.  Whichever 
option is easier for you...thank you for sharing!

{quote}
 I don't think we changed the parse configuration significantly, so it seems 
HTML and XHTML and XML should all have gone through the HtmlParser (I'm not 
100% sure about this, and will try to check).
{quote}

Y, if you could check, I'd be interested.  I think the default behavior would 
be to send XML through the DcXMLParser, which is far stricter than the default 
HtmlParser.  You can see by our choice on tika-server, though, that at least 
one dev prefers to have our HtmlParser handle xml. :)

Thank you, again!


 Let's run Tika against a large batch of docs nightly
 

 Key: TIKA-1302
 URL: https://issues.apache.org/jira/browse/TIKA-1302
 Project: Tika
  Issue Type: Improvement
  Components: cli, general, server
Reporter: Tim Allison

 Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
 running again, it might be fun to run Tika regularly against a large set of 
 docs and report metrics.
 One excellent candidate corpus is govdocs1: 
 http://digitalcorpora.org/corpora/files.
 Any other candidate corpora?  
 [~willp-bl], have anything handy you'd like to contribute? 
 [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1463:
--

 Summary: TesseractOCRParser does work in Windows
 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen


STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)
Tim Barrett created TIKA-1464:
-

 Summary: Too many open files in system when parsing thousands of 
files
 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker


Our big data project parses many thousands of different kinds of files 
sequentially. Up to and including Tika 1.5 this has been trouble free and Tika 
has been a pleasure to use. The files parsed are PDF, MSOffice and MSG files in 
roughly equal measure.

We switched to Tika 1.6 last week and this was a good enhancement for us as a 
number of files (MSOffice) that previously failed to parse do now parse 
correctly under Tika 1.6.

However we have seen that a Too many open files in system exception is raised 
somewhere above 1 files having been parsed. On a windows server this 
exception is not raised but the system eventually begins to crawl.

Watching the system's behaviour with the apache tmp files we see that the 
apache tika files *are* being deleted from the file system, but lsof is showing 
all these files as remaining open by the running process using Tika. It would 
appear that the files are being deleted but handles to these files are not 
being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194685#comment-14194685
 ] 

Nick Burch commented on TIKA-1464:
--

Firstly, make sure you're closing the InputStream / TikaInputStream after 
parsing

Secondly, try with a recent nightly build / build from svn, and see if that 
solves it. There have been some library upgrades that'll be 1.7, which may 
help, but you'll need to use a nightly / snapshot build until 1.7 gets released 
(soonish)

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194691#comment-14194691
 ] 

Tim Barrett commented on TIKA-1464:
---

I double checked the input stream closing thoroughly before reporting. Finally 
clauses which close the input streams exist all over the place in the code so 
it's pretty robust as far as that is concerned. Also please note that open 
files within the process  remain stable under versions pre 1.6 

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194694#comment-14194694
 ] 

Hong-Thai Nguyen commented on TIKA-1463:


Fixed in r1636382

 TesseractOCRParser does work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Summary: TesseractOCRParser does not work in Windows  (was: 
TesseractOCRParser does work in Windows)

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1463:
---
Description: 
STR:
* Case 1:
** Setting tesseractPath to a common installation path of Tesseract:  
C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe

  was:
STR:
* Case 1:
** Setting tesseractPath to C:\Program Files (x86)\Tesseract-OCR
** the checking available Tesseract command returns always false

* Case 2:
** Even setting to no wildcard in tesseractPath, say C:\Tesseract-OCR
** the checking  running command of tesseract on Windows is not correct: 
C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe


 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen closed TIKA-1463.
--
Resolution: Fixed

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194712#comment-14194712
 ] 

Hudson commented on TIKA-1463:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #297 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/297/])
TIKA-1463 - Fix tesseractPath in Windows (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1636382)
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194733#comment-14194733
 ] 

Hudson commented on TIKA-1463:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #277 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/277/])
TIKA-1463 - Fix tesseractPath in Windows (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1636382)
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Tim Barrett (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194753#comment-14194753
 ] 

Tim Barrett commented on TIKA-1464:
---

Built using 1.7-SNAPSHOT from http://repository.apache.org/snapshots - there 
appear now to be fewer open files, but the amount still grows and eventually 
reaches too many open files.

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney


 On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote:
  File Attachment: GribParser - GribParser.java
  https://reviews.apache.org/r/27414/#fcomment48
 
  Is this always available? What happens if we read an InputStream and 
  not a File? Can we still populate Metadata.RESOURCE_NAME_KEY?
 
 Tyler Palsulich wrote:
 I think a better solution would be to create a TikaInputStream, then grab 
 a temporary file from that. See TikaInputStream#get(InputStream, 
 TemporaryResources) and TikaInputStream#getFile(). Then the Parser won't be 
 dependent on a Metadata field.

+1 Tyler


- Lewis


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/#review59346
---


On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27414/
 ---
 
 (Updated Nov. 2, 2014, 3:17 p.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   trunk/tika-parsers/pom.xml 1635045 
 
 Diff: https://reviews.apache.org/r/27414/diff/
 
 
 Testing
 ---
 
 To test the parser in place
 1. Download the patch and three file - GribParserTest.java, GribParser.java 
 and gdas.forecmwf
 2. Put the GribParser.java in the following folder - 
 tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
 folder named grib here
 3. Put the GribParserTest.java in the following folder - 
 tika-parsers/src/test/java/org/apache/tika/parser/grib
 4. Put the resource file in the following location - 
 tika-parsers/src/test/resources/test-documents/
 5. Apply the patch and build.
 
 
 File Attachments
 
 
 ParserTestFile
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
 GribParser
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
 Resource file
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2
 
 
 Thanks,
 
 Vineet Ghatge Hemantkumar
 




[jira] [Created] (TIKA-1465) Implement extraction of non-global variables from netCDF3 and netCDF4

2014-11-03 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created TIKA-1465:
--

 Summary: Implement extraction of non-global variables from netCDF3 
and netCDF4
 Key: TIKA-1465
 URL: https://issues.apache.org/jira/browse/TIKA-1465
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.7


Speaking to Eric Nienhouse at the ongoing NSF funded Polar Cyberinfrastructure 
hackathon in NYC, we became aware that variables parameters contained within 
netCDF3 and netCDF4 are just as valuable (if not more valuable) as global 
attribute values. 
AFAIK, right now we only extract global attributes however we could extend the 
support to cater for the above observations.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote:
  File Attachment: GribParser - GribParser.java
  https://reviews.apache.org/r/27414/#fcomment51
 
  Need a corresponding `xhtml.endElement(ul);`.

Corrected!


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/#review59526
---


On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27414/
 ---
 
 (Updated Nov. 2, 2014, 3:17 p.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   trunk/tika-parsers/pom.xml 1635045 
 
 Diff: https://reviews.apache.org/r/27414/diff/
 
 
 Testing
 ---
 
 To test the parser in place
 1. Download the patch and three file - GribParserTest.java, GribParser.java 
 and gdas.forecmwf
 2. Put the GribParser.java in the following folder - 
 tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
 folder named grib here
 3. Put the GribParserTest.java in the following folder - 
 tika-parsers/src/test/java/org/apache/tika/parser/grib
 4. Put the resource file in the following location - 
 tika-parsers/src/test/resources/test-documents/
 5. Apply the patch and build.
 
 
 File Attachments
 
 
 ParserTestFile
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
 GribParser
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
 Resource file
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2
 
 
 Thanks,
 
 Vineet Ghatge Hemantkumar
 




Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 2, 2014, 5:39 p.m., Tyler Palsulich wrote:
  File Attachment: GribParser - GribParser.java
  https://reviews.apache.org/r/27414/#fcomment52
 
  Need a corresponding `xhtml.endElement(ul);`.

Corrected!


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/#review59526
---


On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27414/
 ---
 
 (Updated Nov. 2, 2014, 3:17 p.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   trunk/tika-parsers/pom.xml 1635045 
 
 Diff: https://reviews.apache.org/r/27414/diff/
 
 
 Testing
 ---
 
 To test the parser in place
 1. Download the patch and three file - GribParserTest.java, GribParser.java 
 and gdas.forecmwf
 2. Put the GribParser.java in the following folder - 
 tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
 folder named grib here
 3. Put the GribParserTest.java in the following folder - 
 tika-parsers/src/test/java/org/apache/tika/parser/grib
 4. Put the resource file in the following location - 
 tika-parsers/src/test/resources/test-documents/
 5. Apply the patch and build.
 
 
 File Attachments
 
 
 ParserTestFile
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
 GribParser
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
 Resource file
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2
 
 
 Thanks,
 
 Vineet Ghatge Hemantkumar
 




Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote:
  File Attachment: GribParser - GribParser.java
  https://reviews.apache.org/r/27414/#fcomment49
 
  Formatting and TikaException message is not correct. I would suggest 
  that we stick to GRIB parse error.
  Additionally, I don't know if it is wise for us to have such a long try 
  catch scenario!

Added multiple try catch for different sections


 On Oct. 31, 2014, 3:22 p.m., Lewis McGibbney wrote:
  File Attachment: GribParser - GribParser.java
  https://reviews.apache.org/r/27414/#fcomment50
 
  Of courser this is not 100% correct as in this case the underlying 
  library is being used to parse GRIB2 files... correct?

Corrected the links and yes we are using to parse grib2 files


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/#review59346
---


On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27414/
 ---
 
 (Updated Nov. 2, 2014, 3:17 p.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   trunk/tika-parsers/pom.xml 1635045 
 
 Diff: https://reviews.apache.org/r/27414/diff/
 
 
 Testing
 ---
 
 To test the parser in place
 1. Download the patch and three file - GribParserTest.java, GribParser.java 
 and gdas.forecmwf
 2. Put the GribParser.java in the following folder - 
 tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
 folder named grib here
 3. Put the GribParserTest.java in the following folder - 
 tika-parsers/src/test/java/org/apache/tika/parser/grib
 4. Put the resource file in the following location - 
 tika-parsers/src/test/resources/test-documents/
 5. Apply the patch and build.
 
 
 File Attachments
 
 
 ParserTestFile
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
 GribParser
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
 Resource file
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2
 
 
 Thanks,
 
 Vineet Ghatge Hemantkumar
 




Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 2, 2014, 3:01 a.m., Chris Mattmann wrote:
  trunk/tika-parsers/pom.xml, line 84
  https://reviews.apache.org/r/27414/diff/1/?file=745304#file745304line84
 
  shouldn't this replace the above dependency

I am not sure if there are components to which depend on it. I know that netcdf 
still depends on the old version of the jar


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/#review59513
---


On Nov. 2, 2014, 3:17 p.m., Vineet Ghatge Hemantkumar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27414/
 ---
 
 (Updated Nov. 2, 2014, 3:17 p.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   trunk/tika-parsers/pom.xml 1635045 
 
 Diff: https://reviews.apache.org/r/27414/diff/
 
 
 Testing
 ---
 
 To test the parser in place
 1. Download the patch and three file - GribParserTest.java, GribParser.java 
 and gdas.forecmwf
 2. Put the GribParser.java in the following folder - 
 tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
 folder named grib here
 3. Put the GribParserTest.java in the following folder - 
 tika-parsers/src/test/java/org/apache/tika/parser/grib
 4. Put the resource file in the following location - 
 tika-parsers/src/test/resources/test-documents/
 5. Apply the patch and build.
 
 
 File Attachments
 
 
 ParserTestFile
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
 GribParser
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
 Resource file
   
 https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2
 
 
 Thanks,
 
 Vineet Ghatge Hemantkumar
 




[jira] [Commented] (TIKA-1464) Too many open files in system when parsing thousands of files

2014-11-03 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195206#comment-14195206
 ] 

Luis Filipe Nassif commented on TIKA-1464:
--

You can attach the file leak detector agent 
(http://file-leak-detector.kohsuke.org/) to the running JVM to track where the 
handles were opened. The http server feature was very useful to me.

 Too many open files in system when parsing thousands of files
 -

 Key: TIKA-1464
 URL: https://issues.apache.org/jira/browse/TIKA-1464
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
 Environment: Os-X 10.10, Windows 8.1 (probably all op systems)
Reporter: Tim Barrett
Priority: Blocker
  Labels: TooManyOpenFilesInSystem

 Our big data project parses many thousands of different kinds of files 
 sequentially. Up to and including Tika 1.5 this has been trouble free and 
 Tika has been a pleasure to use. The files parsed are PDF, MSOffice and MSG 
 files in roughly equal measure.
 We switched to Tika 1.6 last week and this was a good enhancement for us as a 
 number of files (MSOffice) that previously failed to parse do now parse 
 correctly under Tika 1.6.
 However we have seen that a Too many open files in system exception is raised 
 somewhere above 1 files having been parsed. On a windows server this 
 exception is not raised but the system eventually begins to crawl.
 Watching the system's behaviour with the apache tmp files we see that the 
 apache tika files *are* being deleted from the file system, but lsof is 
 showing all these files as remaining open by the running process using Tika. 
 It would appear that the files are being deleted but handles to these files 
 are not being cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1463) TesseractOCRParser does not work in Windows

2014-11-03 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195222#comment-14195222
 ] 

Luis Filipe Nassif commented on TIKA-1463:
--

Curious, because here I can run tesseract command on windows 7 without the .exe 
extension.

 TesseractOCRParser does not work in Windows
 ---

 Key: TIKA-1463
 URL: https://issues.apache.org/jira/browse/TIKA-1463
 Project: Tika
  Issue Type: Bug
Reporter: Hong-Thai Nguyen

 STR:
 * Case 1:
 ** Setting tesseractPath to a common installation path of Tesseract:  
 C:\Program Files (x86)\Tesseract-OCR
 ** the checking available Tesseract command returns always false
 * Case 2:
 ** Even setting to no space value in tesseractPath, says C:\Tesseract-OCR
 ** the checking  running command of tesseract on Windows is not correct: 
 C:\Tesseract-OCR\tesseract, it must be C:\Tesseract-OCR\tesseract.exe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Parse Html with Tika

2014-11-03 Thread Linh Tang
Dear All,

I am Phuong Linh,
I am using Tika to extract content form Html file to search. But HtmlParser
cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
extract the important information, after then use Solr to search.)
Can you tell me what i can do to parse all tag of html.

Thanks advance!

Regards,
Tang Thi Phuong Linh.
-- 
P.Linh


RE: Parse Html with Tika

2014-11-03 Thread Ken Krugler

 From: Linh Tang
 Sent: November 3, 2014 2:30:46pm PST
 To: dev@tika.apache.org
 Subject: Parse Html with Tika
 
 Dear All,
 
 I am Phuong Linh,
 I am using Tika to extract content form Html file to search. But HtmlParser
 cannot parse all tag of Html.  

I'm not sure what you mean by cannot parse all tag of Html.

Do you have an example of an HTML page, and text that isn't being extracted?

-- Ken

 ( I get Html page by Nutch, then use Tika to
 extract the important information, after then use Solr to search.)
 Can you tell me what i can do to parse all tag of html.
 
 Thanks advance!
 
 Regards,
 Tang Thi Phuong Linh.
 -- 
 P.Linh

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr





--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr







Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/
---

Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and 
Vineet Ghatge Hemantkumar.


Bugs: tika-1423
https://issues.apache.org/jira/browse/tika-1423


Repository: tika


Description
---

GRIB Parser Patch


Diffs
-

  ./trunk/tika-parsers/pom.xml 1636144 
  
./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java 
PRE-CREATION 
  
./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
 1636144 
  
./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
 PRE-CREATION 
  
./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
 UNKNOWN 

Diff: https://reviews.apache.org/r/27562/diff/


Testing
---

update for #27414


Thanks,

Chris Mattmann



Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/
---

(Updated Nov. 4, 2014, 5:17 a.m.)


Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, and 
Vineet Ghatge Hemantkumar.


Bugs: tika-1423
https://issues.apache.org/jira/browse/tika-1423


Repository: tika


Description
---

GRIB Parser Patch


Diffs
-

  ./trunk/tika-parsers/pom.xml 1636144 
  
./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java 
PRE-CREATION 
  
./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
 1636144 
  
./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
 PRE-CREATION 
  
./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
 UNKNOWN 

Diff: https://reviews.apache.org/r/27562/diff/


Testing (updated)
---

update for #27414

FYI for the life of me, I can't get the unit test to pass (was this working for 
you @Vinegh?)
Patch is fully up to date with trunk and compiles at least.


Thanks,

Chris Mattmann



Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59728
---


Patch is looking good. I am testing

- Lewis McGibbney


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1636144 
   
 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
  UNKNOWN 
 
 Diff: https://reviews.apache.org/r/27562/diff/
 
 
 Testing
 ---
 
 update for #27414
 
 FYI for the life of me, I can't get the unit test to pass (was this working 
 for you @Vinegh?)
 Patch is fully up to date with trunk and compiles at least.
 
 
 Thanks,
 
 Chris Mattmann
 




Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote:
  Patch is looking good. I am testing

Yes @mattmann, the unit test passes for me


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59728
---


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1636144 
   
 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
  UNKNOWN 
 
 Diff: https://reviews.apache.org/r/27562/diff/
 
 
 Testing
 ---
 
 update for #27414
 
 FYI for the life of me, I can't get the unit test to pass (was this working 
 for you @Vinegh?)
 Patch is fully up to date with trunk and compiles at least.
 
 
 Thanks,
 
 Chris Mattmann
 




Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/
---

(Updated Nov. 4, 2014, 5:36 a.m.)


Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.


Bugs: tika-1423
https://issues.apache.org/jira/browse/tika-1423


Repository: tika


Description
---

GRIB Parser Patch


Diffs
-

  trunk/tika-parsers/pom.xml 1635045 

Diff: https://reviews.apache.org/r/27414/diff/


Testing
---

To test the parser in place
1. Download the patch and three file - GribParserTest.java, GribParser.java and 
gdas.forecmwf
2. Put the GribParser.java in the following folder - 
tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
folder named grib here
3. Put the GribParserTest.java in the following folder - 
tika-parsers/src/test/java/org/apache/tika/parser/grib
4. Put the resource file in the following location - 
tika-parsers/src/test/resources/test-documents/
5. Apply the patch and build.


File Attachments


ParserTestFile
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
GribParser
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
Resource file
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2


Thanks,

Vineet Ghatge Hemantkumar



Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney


 On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote:
  Patch is looking good. I am testing
 
 Vineet Ghatge Hemantkumar wrote:
 Yes @mattmann, the unit test passes for me

what is the grib file please? Where can I find it?


- Lewis


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59728
---


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1636144 
   
 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
  UNKNOWN 
 
 Diff: https://reviews.apache.org/r/27562/diff/
 
 
 Testing
 ---
 
 update for #27414
 
 FYI for the life of me, I can't get the unit test to pass (was this working 
 for you @Vinegh?)
 Patch is fully up to date with trunk and compiles at least.
 
 
 Thanks,
 
 Chris Mattmann
 




Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:23 a.m., Lewis McGibbney wrote:
  Patch is looking good. I am testing
 
 Vineet Ghatge Hemantkumar wrote:
 Yes @mattmann, the unit test passes for me
 
 Lewis McGibbney wrote:
 what is the grib file please? Where can I find it?

This is the grib file - gdas1.forecmwf.2014062612.grib2 and this is under the 
following location
./trunk/tika-parsers/src/test/resources/test-documents/


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59728
---


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1636144 
   
 ./trunk/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/test/resources/test-documents/gdas1.forecmwf.2014062612.grib2
  UNKNOWN 
 
 Diff: https://reviews.apache.org/r/27562/diff/
 
 
 Testing
 ---
 
 update for #27414
 
 FYI for the life of me, I can't get the unit test to pass (was this working 
 for you @Vinegh?)
 Patch is fully up to date with trunk and compiles at least.
 
 
 Thanks,
 
 Chris Mattmann
 




Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Lewis McGibbney

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59734
---


OK, test is also failing for me with Tika trunk as follows

  1 
---
  2 Test set: org.apache.tika.parser.grib.GribParserTest
  3 
---
  4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 sec 
 FAILURE!
  5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest)  Time 
elapsed: 0.075 sec   ERROR!
  6 org.apache.tika.exception.TikaException: NetCDF parse error
  7 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134)
  8 at 
org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52)
  9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 10 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 11 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 12 at java.lang.reflect.Method.invoke(Method.java:606)
 13 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
 14 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 15 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
 16 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
 18 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
 19 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
 20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
 21 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
 22 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
 23 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
 24 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
 25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
 26 at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
 27 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
 28 at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
 29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 30 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 31 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 32 at java.lang.reflect.Method.invoke(Method.java:606)
 33 at 
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
 34 at 
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
 35 at 
org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
 36 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
 37 at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
 38 Caused by: java.io.IOException: Cant read gdas1.forecmwf.2014062612.grib2: 
not a valid CDM file.
 39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803)
 40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719)
 41 at org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86)
 42 ... 30 more
 
 We need to do more work here, also there are a number of issue which need to 
be addressed and carried over from th previous issue I feel.

- Lewis McGibbney


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 ./trunk/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java
  PRE-CREATION 
   
 ./trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
  1636144 
   
 

Re: Review Request 27414: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27414/
---

(Updated Nov. 4, 2014, 5:48 a.m.)


Review request for tika, Lewis McGibbney, Chris Mattmann, and Tyler Palsulich.


Bugs: tika-1423
https://issues.apache.org/jira/browse/tika-1423


Repository: tika


Description
---

GRIB Parser Patch


Diffs
-

  trunk/tika-parsers/pom.xml 1635045 

Diff: https://reviews.apache.org/r/27414/diff/


Testing
---

To test the parser in place
1. Download the patch and three file - GribParserTest.java, GribParser.java and 
gdas.forecmwf
2. Put the GribParser.java in the following folder - 
tika-parsers/src/main/java/org/apache/tika/parser/grib. You will need to have 
folder named grib here
3. Put the GribParserTest.java in the following folder - 
tika-parsers/src/test/java/org/apache/tika/parser/grib
4. Put the resource file in the following location - 
tika-parsers/src/test/resources/test-documents/
5. Apply the patch and build.


File Attachments


ParserTestFile
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/840fcf4b-d67f-4ed5-8e7c-52d49c74c9d0__GribParserTest.java
GribParser
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/2f897768-d61e-4985-a254-4a45fc821524__GribParser.java
Resource file
  
https://reviews.apache.org/media/uploaded/files/2014/10/31/a47d7101-98d7-4833-94f3-cdf31351e19e__gdas1.forecmwf.2014062612.grib2


Thanks,

Vineet Ghatge Hemantkumar



Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote:
  OK, test is also failing for me with Tika trunk as follows
  
1 
  ---
2 Test set: org.apache.tika.parser.grib.GribParserTest
3 
  ---
4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 
  sec  FAILURE!
5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest)  
  Time elapsed: 0.075 sec   ERROR!
6 org.apache.tika.exception.TikaException: NetCDF parse error
7 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134)
8 at 
  org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52)
9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   10 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   11 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   12 at java.lang.reflect.Method.invoke(Method.java:606)
   13 at 
  org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   14 at 
  org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   15 at 
  org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   16 at 
  org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
   18 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
   19 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
   20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   21 at 
  org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
   22 at 
  org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
   23 at 
  org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
   24 at 
  org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
   25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   26 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
   27 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
   28 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
   29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   30 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   31 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   32 at java.lang.reflect.Method.invoke(Method.java:606)
   33 at 
  org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
   34 at 
  org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
   35 at 
  org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
   36 at 
  org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
   37 at 
  org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
   38 Caused by: java.io.IOException: Cant read 
  gdas1.forecmwf.2014062612.grib2: not a valid CDM file.
   39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803)
   40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719)
   41 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86)
   42 ... 30 more
   
   We need to do more work here, also there are a number of issue which need 
  to be addressed and carried over from th previous issue I feel.

I will see why its failing.


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59734
---


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet Ghatge Hemantkumar.
 
 
 Bugs: tika-1423
 https://issues.apache.org/jira/browse/tika-1423
 
 
 Repository: tika
 
 
 Description
 ---
 
 GRIB Parser Patch
 
 
 Diffs
 -
 
   ./trunk/tika-parsers/pom.xml 1636144 
   
 

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote:
  OK, test is also failing for me with Tika trunk as follows
  
1 
  ---
2 Test set: org.apache.tika.parser.grib.GribParserTest
3 
  ---
4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 
  sec  FAILURE!
5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest)  
  Time elapsed: 0.075 sec   ERROR!
6 org.apache.tika.exception.TikaException: NetCDF parse error
7 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134)
8 at 
  org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52)
9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   10 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   11 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   12 at java.lang.reflect.Method.invoke(Method.java:606)
   13 at 
  org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   14 at 
  org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   15 at 
  org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   16 at 
  org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
   18 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
   19 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
   20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   21 at 
  org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
   22 at 
  org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
   23 at 
  org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
   24 at 
  org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
   25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   26 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
   27 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
   28 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
   29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   30 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   31 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   32 at java.lang.reflect.Method.invoke(Method.java:606)
   33 at 
  org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
   34 at 
  org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
   35 at 
  org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
   36 at 
  org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
   37 at 
  org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
   38 Caused by: java.io.IOException: Cant read 
  gdas1.forecmwf.2014062612.grib2: not a valid CDM file.
   39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803)
   40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719)
   41 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86)
   42 ... 30 more
   
   We need to do more work here, also there are a number of issue which need 
  to be addressed and carried over from th previous issue I feel.
 
 Vineet Ghatge Hemantkumar wrote:
 I will see why its failing.

In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, 
gdas1.forecmwf.2014062612.grib2); this shoudl be the following
metadata.set(Metadata.RESOURCE_NAME_KEY, 
/test-documents/gdas1.forecmwf.2014062612.grib2);


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59734
---


On Nov. 4, 2014, 5:17 a.m., Chris Mattmann wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27562/
 ---
 
 (Updated Nov. 4, 2014, 5:17 a.m.)
 
 
 Review request for tika, Lewis McGibbney, Chris Mattmann, Tyler Palsulich, 
 and Vineet 

Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote:
  OK, test is also failing for me with Tika trunk as follows
  
1 
  ---
2 Test set: org.apache.tika.parser.grib.GribParserTest
3 
  ---
4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 
  sec  FAILURE!
5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest)  
  Time elapsed: 0.075 sec   ERROR!
6 org.apache.tika.exception.TikaException: NetCDF parse error
7 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134)
8 at 
  org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52)
9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   10 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   11 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   12 at java.lang.reflect.Method.invoke(Method.java:606)
   13 at 
  org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   14 at 
  org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   15 at 
  org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   16 at 
  org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
   18 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
   19 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
   20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   21 at 
  org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
   22 at 
  org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
   23 at 
  org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
   24 at 
  org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
   25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   26 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
   27 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
   28 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
   29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   30 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   31 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   32 at java.lang.reflect.Method.invoke(Method.java:606)
   33 at 
  org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
   34 at 
  org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
   35 at 
  org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
   36 at 
  org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
   37 at 
  org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
   38 Caused by: java.io.IOException: Cant read 
  gdas1.forecmwf.2014062612.grib2: not a valid CDM file.
   39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803)
   40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719)
   41 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86)
   42 ... 30 more
   
   We need to do more work here, also there are a number of issue which need 
  to be addressed and carried over from th previous issue I feel.
 
 Vineet Ghatge Hemantkumar wrote:
 I will see why its failing.
 
 Vineet Ghatge Hemantkumar wrote:
 In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, 
 gdas1.forecmwf.2014062612.grib2); this shoudl be the following
 metadata.set(Metadata.RESOURCE_NAME_KEY, 
 /test-documents/gdas1.forecmwf.2014062612.grib2);

Further, I am using the netcdfall 4.5 jar which is not what is there in repo, 
When I run from command line using the following coommand 
javac -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar 
Grib.java, java -cp 
.:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib it works. I 
am trying rebuild in tika


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27562/#review59734
---


On Nov. 4, 2014, 

Re: Parse Html with Tika

2014-11-03 Thread Julien Nioche
Hi Linh

You can specify a mapper to control what the html parser will filter or not.

see
https://github.com/DigitalPebble/storm-crawler/commit/27364cb7ddb3998f973ab6e09f384e28cc5b7639
for an example

Julien

On Monday, 3 November 2014, Linh Tang ttplinh2...@gmail.com wrote:

 Dear All,

 I am Phuong Linh,
 I am using Tika to extract content form Html file to search. But HtmlParser
 cannot parse all tag of Html.  ( I get Html page by Nutch, then use Tika to
 extract the important information, after then use Solr to search.)
 Can you tell me what i can do to parse all tag of html.

 Thanks advance!

 Regards,
 Tang Thi Phuong Linh.
 --
 P.Linh



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Review Request 27562: GRIB Parser for TIKA

2014-11-03 Thread Vineet Ghatge Hemantkumar


 On Nov. 4, 2014, 5:45 a.m., Lewis McGibbney wrote:
  OK, test is also failing for me with Tika trunk as follows
  
1 
  ---
2 Test set: org.apache.tika.parser.grib.GribParserTest
3 
  ---
4 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.076 
  sec  FAILURE!
5 testParseGlobalMetadata(org.apache.tika.parser.grib.GribParserTest)  
  Time elapsed: 0.075 sec   ERROR!
6 org.apache.tika.exception.TikaException: NetCDF parse error
7 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:134)
8 at 
  org.apache.tika.parser.grib.GribParserTest.testParseGlobalMetadata(GribParserTest.java:52)
9 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   10 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   11 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   12 at java.lang.reflect.Method.invoke(Method.java:606)
   13 at 
  org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   14 at 
  org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   15 at 
  org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   16 at 
  org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
   17 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
   18 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
   19 at 
  org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
   20 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
   21 at 
  org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
   22 at 
  org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
   23 at 
  org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
   24 at 
  org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
   25 at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   26 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
   27 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
   28 at 
  org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
   29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   30 at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   31 at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   32 at java.lang.reflect.Method.invoke(Method.java:606)
   33 at 
  org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
   34 at 
  org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
   35 at 
  org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
   36 at 
  org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
   37 at 
  org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)
   38 Caused by: java.io.IOException: Cant read 
  gdas1.forecmwf.2014062612.grib2: not a valid CDM file.
   39 at ucar.nc2.NetcdfFile.open(NetcdfFile.java:803)
   40 at ucar.nc2.NetcdfFile.openInMemory(NetcdfFile.java:719)
   41 at 
  org.apache.tika.parser.grib.GribParser.parse(GribParser.java:86)
   42 ... 30 more
   
   We need to do more work here, also there are a number of issue which need 
  to be addressed and carried over from th previous issue I feel.
 
 Vineet Ghatge Hemantkumar wrote:
 I will see why its failing.
 
 Vineet Ghatge Hemantkumar wrote:
 In the patch, I see that metadata.set(Metadata.RESOURCE_NAME_KEY, 
 gdas1.forecmwf.2014062612.grib2); this shoudl be the following
 metadata.set(Metadata.RESOURCE_NAME_KEY, 
 /test-documents/gdas1.forecmwf.2014062612.grib2);
 
 Vineet Ghatge Hemantkumar wrote:
 Further, I am using the netcdfall 4.5 jar which is not what is there in 
 repo, When I run from command line using the following coommand 
 javac -cp .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar 
 Grib.java, java -cp 
 .:netcdfAll-4.5.jar:tika-app-1.7-SNAPSHOT.jar:junit-4.11.jar Grib it works. I 
 am trying rebuild in tika

I am trying to apply the patch and it keeps erroring out that this not a valid 
patch?


- Vineet Ghatge


---
This is an automatically generated e-mail. To reply,