I am trying to crawl a URL that has space in it.

NUTCH-661 suggests that his can be fixed with a urlnormalizer plugin.
https://issues.apache.org/jira/browse/NUTCH-661

I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) and I
put the below rule in the conf/regex-normalize.xml file

<regex>
  <pattern>\s</pattern>
  <substitution>%20</substitution>
</regex>


But still the URL with space is not getting crawled.

Any hint, as to, what needs to be added in the the conf/regex-normalize.xml
file, to make Nutch crawl URLs with spaces.

-------
Thanks/Regards,
Parvez

Reply via email to