tika parser not work properly with unwanted file types that passed from filters 
in nutch
----------------------------------------------------------------------------------------

                 Key: NUTCH-1281
                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
             Project: Nutch
          Issue Type: Improvement
          Components: parser
            Reporter: behnam nikbakht


when in parse-plugins.xml, set this property:
<mimeType name="*">
        <plugin id="parse-tika" />
</mimeType>
all unwanted files that pass from all filters, refered to tika
but for some file types like .flv, tika parser has problem and hunged and cause 
to fail in parse Job.
if this file types passed from regex-urlfilter and other filters, parse job 
failed.
for this problem I suggest that add some properties for valid file types, and 
use this code in TikaParser.java, like this:


public ParseResult getParse(Content content) {
                String mimeType = content.getContentType();

+               String[]validTypes=new 
String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- 
ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
+               boolean valid=false;
+               for(int k=0;k<validTypes.length;k++){
+                       if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
+                               valid=true;
+               }
+               if(!valid)
+                       return new ParseStatus(ParseStatus.NOTPARSED, "Can't 
parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), 
getConf());
        
                URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to