Thanks Stefan.
So one has to iterate and re-write the whole graph, and there is no way to just 
modify it on the fly by, for example, removing specific links/pages?

Thanks,
Otis

----- Original Message ----
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Friday, July 7, 2006 1:52:24 AM
Subject: Re: [Nutch-general] Link db (traversal + modification)

Hi Otis,

the link graph live in the linkdb.
I suggest to write a small map reduce tool that reads the existing  
linkDb filter the pages you want to remove and write the result back  
to disk.
This will be just a couble lines of code.
The hadoop package comes with some nice map reduce examples.

Stefan


On 06.07.2006, at 22:47, <[EMAIL PROTECTED]> wrote:

> Hi,
>
> What's the best way to travere the graph of all fetched pages and  
> optionally modify it (e.g. remove a page because you know it's spam)?
> I looked at various Nutch classes, and only LinksDbReader looks  
> like it let's you iterate through all links (and for each link get  
> its inlinks).  Is this right?
>
> But how would one go about modifying the links db?
> Perhaps I should be asking about where/how the links db is stored  
> on disk, and whether one should just access and modify that data  
> directly on disk?
>
> Thanks,
> Otis
>
>
>


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general



Reply via email to