Lucifersam wrote:
> Andrzej Bialecki wrote:
>   
>> Lucifersam wrote:
>>     
>>> Finally - I seem to have a problem with identical pages with different
>>> urls
>>> - i.e.
>>>
>>> http://website/
>>> http://website/default.htm
>>>
>>> I was under the impression that these would be removed by the dedup
>>> process,
>>> but this does not seem to be working. Is there something I'm missing? 
>>>       
>> Most likely the pages are slightly different - you can save them to 
>> files, and then run a diff utility to check for differences.
>>
>>     
>
> You're right, there was a small difference in the HTML concerning some
> timing comment, e.g:
>
> <!--Exec time = 265.625-->
>
> As this is not strictly content - is there a simply way to ignore anything
> within comments when looking at the content of a page?
>   


You can provide your own implementation of a Signature - please see the 
javadocs for this class - and then set this class in nutch-site.xml.

A common trick is to use just the plain text version of the page, and 
further "normalize" it by replacing all whitespace with exactly single 
spaces, bringing all tokens to lowercase, optionally filter out all 
digits, and also optionally removing all words that occur only once.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to