I've run into a small issue with my deployment of Nutch. Some of the sites I
crawl use characters such as æøå in their URLs, and these never seem to
parse properly. Is there any way to get around this? I tried adding the
UTF-values (as '\u00e5' and so on) in regex-normalize.xml, but I suppose
they may be misparsed already when they're fetched, so they aren't actually
seen as e.g. character 00e5. Any suggestions would be much appreciated.

Reply via email to