Thanks Stefan. So one has to iterate and re-write the whole graph, and there is no way to just modify it on the fly by, for example, removing specific links/pages?
Thanks, Otis ----- Original Message ---- From: Stefan Groschupf <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Friday, July 7, 2006 1:52:24 AM Subject: Re: [Nutch-general] Link db (traversal + modification) Hi Otis, the link graph live in the linkdb. I suggest to write a small map reduce tool that reads the existing linkDb filter the pages you want to remove and write the result back to disk. This will be just a couble lines of code. The hadoop package comes with some nice map reduce examples. Stefan On 06.07.2006, at 22:47, <[EMAIL PROTECTED]> wrote: > Hi, > > What's the best way to travere the graph of all fetched pages and > optionally modify it (e.g. remove a page because you know it's spam)? > I looked at various Nutch classes, and only LinksDbReader looks > like it let's you iterate through all links (and for each link get > its inlinks). Is this right? > > But how would one go about modifying the links db? > Perhaps I should be asking about where/how the links db is stored > on disk, and whether one should just access and modify that data > directly on disk? > > Thanks, > Otis > > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general