I'm wondering why don't we have the option to normalize when we merge some
segments.
It should be similar as mergedb and mergelinkdb.

For instance, let's say i have two urls crawled:
http://auto.yahoo.com/index.php?auto=BMW&sort=desc
http://auto.yahoo.com/index.php?auto=BMW
The page content is the same but the display is different due to the sort
parameter. So i don't need to index twice the page.
I will then normalize the urls in order to remove some extra parameters
(sort=) and thus reduce my duplicate content i.e
http://auto.yahoo.com/index.php?auto=BMW&sort=desc will become
http://auto.yahoo.com/index.php?auto=BMW

This url normalized will be removed when i will merge my crawldb and my
linkdb. We should then do it also on the segments.
I don't see the point to keep some crawl_generate, parse_data, etc which
contains an url which doesn't exist anymore in the crawldb.

Maybe am i missing something in this case please help to understand ?
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to