[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211316#comment-13211316
 ] 

behnam nikbakht commented on NUTCH-1281:
----------------------------------------

Problem is that actual mime-types can not properly filtered until the parse or 
fetch start. and here are many file types that we can not filter all of them, 
and maybe there are some bugs with tika parser with some file types.
so we can filter them in TikaParser from valid file types.
                
> tika parser not work properly with unwanted file types that passed from 
> filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and 
> cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job 
> failed.
> for this problem I suggest that add some properties for valid file types, and 
> use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
>               String mimeType = content.getContentType();
> +             String[]validTypes=new 
> String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- 
> ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +             boolean valid=false;
> +             for(int k=0;k<validTypes.length;k++){
> +                     if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +                             valid=true;
> +             }
> +             if(!valid)
> +                     return new ParseStatus(ParseStatus.NOTPARSED, "Can't 
> parse for unwanted filetype "+ 
> mimeType).getEmptyParseResult(content.getUrl(), getConf());
>       
>               URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to