I am trying to crawl a URL that has space in it. NUTCH-661 suggests that his can be fixed with a urlnormalizer plugin. https://issues.apache.org/jira/browse/NUTCH-661
I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) and I put the below rule in the conf/regex-normalize.xml file <regex> <pattern>\s</pattern> <substitution>%20</substitution> </regex> But still the URL with space is not getting crawled. Any hint, as to, what needs to be added in the the conf/regex-normalize.xml file, to make Nutch crawl URLs with spaces. ------- Thanks/Regards, Parvez
