tika parser not work properly with unwanted file types that passed from filters in nutch ----------------------------------------------------------------------------------------
Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: <mimeType name="*"> <plugin id="parse-tika" /> </mimeType> all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"}; + boolean valid=false; + for(int k=0;k<validTypes.length;k++){ + if(validTypes[k].compareTo(mimeType.toLowerCase())==0) + valid=true; + } + if(!valid) + return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf()); URL base; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira