[jira] [Commented] (NUTCH-1454) parsing chm failed

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860803#comment-13860803
 ] 

Tejas Patil commented on NUTCH-1454:


TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the 
same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch.

 parsing chm failed
 --

 Key: NUTCH-1454
 URL: https://issues.apache.org/jira/browse/NUTCH-1454
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.9


 (reported by Jan Riewe, see 
 http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
 Nutch fails to parse chm files with
 {quote}
  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
 application/vnd.ms-htmlhelp
 {quote}
 Tested with chm test files from Tika:
 {code}
  % bin/nutch parsechecker 
 file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
 {code}
 Tika parses this document (but does not extract any content).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1454) parsing chm failed

2013-03-05 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13594411#comment-13594411
 ] 

Tejas Patil commented on NUTCH-1454:


Few observations about this issue:
1. Nutch is getting the correct mime type for the document. While parsing the 
content, this error occurs. 
2. Even after running tika-app in standalone manner (ie. not via nutch), I 
could see not even a single chm file being parsed (I tried with 10-15 different 
chm files of variable sizes). I had added this observation to a [relevant jira 
in 
tika|https://issues.apache.org/jira/browse/TIKA-245?focusedCommentId=13594074page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13594074]
 project but no reply till now. 
3. People in tika community have observed that chm4j library performs better 
than their chm parser implementation. Anyone in dire need to crawl and parse 
chm documents can leverage this library. Ideally we should use this library in 
nutch but as there are very low % of users in need of parsing chm, we should 
refrain from doing it.

 parsing chm failed
 --

 Key: NUTCH-1454
 URL: https://issues.apache.org/jira/browse/NUTCH-1454
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.7


 (reported by Jan Riewe, see 
 http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
 Nutch fails to parse chm files with
 {quote}
  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
 application/vnd.ms-htmlhelp
 {quote}
 Tested with chm test files from Tika:
 {code}
  % bin/nutch parsechecker 
 file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
 {code}
 Tika parses this document (but does not extract any content).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira