[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1043: --------------------------------- Patch Info: [Patch Available] > Add pattern for filtering .js in default url filters > ---------------------------------------------------- > > Key: NUTCH-1043 > URL: https://issues.apache.org/jira/browse/NUTCH-1043 > Project: Nutch > Issue Type: Task > Affects Versions: 1.4, 2.0 > Reporter: Julien Nioche > Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1043.patch > > > The Javascript parser is not used by default as it is extremely noisy, > however the default URL filters do not filter out URLs ending in .js and the > default parser (Tika) can't parse them. In a nutshell we are fetching URLS > that we know can't be parsed. > I suggest that we add a regex to the default URL filters. If people are > interested in fetching and parsing .js files they can activate the plugin in > their conf and remove the regex in the URL filters. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira