I've noticed when running 'nutch invertlinks' that the URLs in the link database are not normalized according to the rules defined in regex-urlnormalize.xml. At first, I thought this was a configuration error, but after debugging I realized that if this is the first time I'm running 'invertlinks', the normalization does not occur. However, on subsequent runs the normalization DOES occur.
I think the root of the problem is in the crawl/LinkDb.java file. There is a comment that "if we don't run the mergeJob, perform normalization/filtering now", then there is a check to see if the linkDb directory exists, and if it does, normalization/filtering will be skipped. I believe this directory will always exist because in the 'invert' function a lock file is created in that directory. If the directory didn't exist before, I believe creating the lock file also creates the directory. This, then, prevents normalization/filtering from occurring. A simple work-around is to run the 'invertlinks' command twice. This invokes the mergeJob and normalization/filtering then occurs. I think this is a bug, but I'd like someone more familiar with the code to take a look. Thanks, Eric Severance - http://esev.com/