Hi,
the hadoop io system is read only, so you can not update a file.
However if I'm sure you can hack the link db creation code and add
the url filter that is already used for the crawldb.
May be this is already in the code, if not it would be good since it
minimize spam links to take effect in the ranking.
Stefan
On 07.07.2006, at 09:50, <[EMAIL PROTECTED]> wrote:
Thanks Stefan.
So one has to iterate and re-write the whole graph, and there is no
way to just modify it on the fly by, for example, removing specific
links/pages?
Thanks,
Otis
----- Original Message ----
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Friday, July 7, 2006 1:52:24 AM
Subject: Re: [Nutch-general] Link db (traversal + modification)
Hi Otis,
the link graph live in the linkdb.
I suggest to write a small map reduce tool that reads the existing
linkDb filter the pages you want to remove and write the result back
to disk.
This will be just a couble lines of code.
The hadoop package comes with some nice map reduce examples.
Stefan
On 06.07.2006, at 22:47, <[EMAIL PROTECTED]> wrote:
Hi,
What's the best way to travere the graph of all fetched pages and
optionally modify it (e.g. remove a page because you know it's spam)?
I looked at various Nutch classes, and only LinksDbReader looks
like it let's you iterate through all links (and for each link get
its inlinks). Is this right?
But how would one go about modifying the links db?
Perhaps I should be asking about where/how the links db is stored
on disk, and whether one should just access and modify that data
directly on disk?
Thanks,
Otis
Using Tomcat but need to do more? Need to support web services,
security?
Get stuff done quickly with pre-integrated technology to make your
job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache
Geronimo
http://sel.as-us.falkag.net/sel?
cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general