[ 
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064065#comment-13064065
 ] 

Julien Nioche commented on NUTCH-1043:
--------------------------------------

My point here is not to add all the suffixes that are not supported by the 
default configuration of parsers but only to add the most common ones to the 
url filters, like what we currently have for the existing URL filters i.e.

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

@Markus -> the suffix urlfilter already covers some of the things that you've 
listed. You're not using it? It is not activated by default but people would be 
able to do so if they need the extended list of suffixes

I am not against adding more suffixes to the URL filters but would rather keep 
it as simple and close to the existing as possible. What do you think?


> Add pattern for filtering .js in default url filters
> ----------------------------------------------------
>
>                 Key: NUTCH-1043
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1043
>             Project: Nutch
>          Issue Type: Task
>    Affects Versions: 1.4, 2.0
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>
> The Javascript parser is not used by default as it is extremely noisy, 
> however the default URL filters do not filter out URLs ending in .js and the 
> default parser (Tika) can't parse them. In a nutshell we are fetching URLS 
> that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are 
> interested in fetching and parsing .js files they can activate the plugin in 
> their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to