[
https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki closed NUTCH-612.
-----------------------------------
Resolution: Fixed
Assignee: Andrzej Bialecki
> URL filtering is always disabled in Generator when invoked by Crawl
> -------------------------------------------------------------------
>
> Key: NUTCH-612
> URL: https://issues.apache.org/jira/browse/NUTCH-612
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 1.0.0
> Reporter: Susam Pal
> Assignee: Andrzej Bialecki
> Fix For: 1.0.0
>
> Attachments: NUTCH-612v0.1.patch
>
>
> When a crawl is done using the 'bin/nutch crawl' command, no filtering is
> done in Generator even if 'crawl.generate.filter' is set to true in the
> configuration file.
> The problem is that in the Generator's generate method, the following code
> unconditionally sets the filter value of the job to whatever is passed to it:-
> {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}
> The code in Crawl.java always passes this as false.
> This has been fixed by exposing an overloaded generate method which takes
> only the 5 arguments that Crawl needs to set. This overloaded method reads
> the configuration and sets the filter value appropriately.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.