I've noticed when running 'nutch invertlinks' that the URLs in the
link database are not normalized according to the rules defined in
regex-urlnormalize.xml.  At first, I thought this was a configuration
error, but after debugging I realized that if this is the first time
I'm running 'invertlinks', the normalization does not occur.  However,
on subsequent runs the normalization DOES occur.

I think the root of the problem is in the crawl/LinkDb.java file.
There is a comment that "if we don't run the mergeJob, perform
normalization/filtering now", then there is a check to see if the
linkDb directory exists, and if it does, normalization/filtering will
be skipped.  I believe this directory will always exist because in the
'invert' function a lock file is created in that directory.  If the
directory didn't exist before, I believe creating the lock file also
creates the directory.  This, then, prevents normalization/filtering
from occurring.

A simple work-around is to run the 'invertlinks' command twice.  This
invokes the mergeJob and normalization/filtering then occurs.

I think this is a bug, but I'd like someone more familiar with the
code to take a look.

Thanks,
Eric Severance - http://esev.com/

Reply via email to