Re: near duplicates

2006-10-18 Thread John Casey
On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote: Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the paper "Detecting P

Re: The Nutch Crawler and the Web Link Graph

2006-08-16 Thread John Casey
thanks it works perfectly although I did end up merging the segments rather than using your MapFileReader. On 8/15/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: John Casey wrote: > Hi All, is there any way to extract the outlinks of particular > webpage/URL? > I have

The Nutch Crawler and the Web Link Graph

2006-08-15 Thread John Casey
Hi All, is there any way to extract the outlinks of particular webpage/URL? I have had a look the LinkDBReader but this will only give me a listing of pages that link to the page in question. Any ideas ? I have been having a look in the segments directory and have been trying to read/parse the fil