Well, it looks like the link I sent you goes to the 0.9 version of the nutch api. There is a link error on the nutch project site because the 0.7.2 doc link points to the 0.9 docs.
On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote: > Here is the link to the docs: > http://lucene.apache.org/nutch/apidocs/index.html > > You would then need to create a filter of 'pruned' urls to ignore if > they are discovered again. This list can get quite large, but I > really don't know how else to do it. It would be cool if we could > hack the crawldb (or webdb I believe in your version) to include a > flag of 'good/bad' or something. > > > On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote: > > Isn't this what you are looking for? > > > > org.apache.nutch.tools.PruneIndexTool. > > > > > > > > On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote: > > > > > > hi Enis, > > > This is franklin ..currently i m using nutch 0.7.2 for my crawling and > > > indexing for my search engine... > > > i read from ur message that u can delete a particular index directly?if so > > > how its possible..i m desperately searching for a clue to do this one... > > > my requirement is to delete the porn site's index from my crawled data... > > > ur help is highly needed.... > > > > > > expecting u to help me in this regards .. > > > > > > Thanks in advance.. > > > Franklin.S > > > > > > > > > ogjunk-nutch wrote: > > > > > > > > Hi Enis, > > > > > > > > Right, I can easily delete the page from the Lucene index, though I'd > > > > prefer to follow the Nutch protocol and avoid messing something up by > > > > touching the index directly. However, I don't want that page to > > > > re-appear > > > > in one of the subsequent fetches. Well, it won't re-appear, because it > > > > will remain missing, but it would be great to be able to tell Nutch to > > > > "forget it" "from everywhere". Is that doable? > > > > I could read and re-write the *Db Maps, but that's a lot of IO... just > > > > to > > > > get a couple of URLs erased. I'd prefer a friendly persuasion where > > > > Nutch > > > > flags a given page as "forget this page as soon as possible" and it just > > > > happens later on. > > > > > > > > Thanks, > > > > Otis > > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > > > > > ----- Original Message ---- > > > > From: Enis Soztutar <[EMAIL PROTECTED]> > > > > To: [EMAIL PROTECTED] > > > > Sent: Thursday, April 5, 2007 3:29:55 AM > > > > Subject: Re: [Nutch-general] Removing pages from index immediately > > > > > > > > Since hadoop's map files are write once, it is not possible to delete > > > > some urls from the crawldb and linkdb. The only thing you can do is to > > > > create the map files once again without the deleted urls. But running > > > > the crawl once more as you suggested seems more appropriate. Deleting > > > > documents from the index is just lucene stuff. > > > > > > > > In your case it seems that every once in a while, you crawl the whole > > > > site, and create the indexes and db's and then just throw the old one > > > > out. And between two crawls you can delete the urls from the index. > > > > > > > > [EMAIL PROTECTED] wrote: > > > >> Hi, > > > >> > > > >> I'd like to be able to immediately remove certain pages from Nutch > > > >> (index, crawldb, linkdb...). > > > >> The scenario is that I'm using Nutch to index a single site or a set of > > > >> internal sites. Once in a while editors of the site remove a page from > > > >> the site. When that happens, I want to update at least the index and > > > >> ideally crawldb, linkdb, so that people searching the index don't get > > > >> the > > > >> missing page in results and end up going there, hitting the 404. > > > >> > > > >> I don't think there is a "direct" way to do this with Nutch, is there? > > > >> If there really is no direct way to do this, I was thinking I'd just > > > >> put > > > >> the URL of the recently removed page into the first next fetchlist and > > > >> then somehow get Nutch to immediately remove that page/URL once it > > > >> hits a > > > >> 404. How does that sound? > > > >> > > > >> Is there a way to configure Nutch to delete the page after it gets a > > > >> 404 > > > >> for it even just once? I thought I saw the setting for that somewhere > > > >> a > > > >> few weeks ago, but now I can't find it. > > > >> > > > >> Thanks, > > > >> Otis > > > >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > > >> Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > Take Surveys. Earn Cash. Influence the Future of IT > > > > Join SourceForge.net's Techsay panel and you'll get the chance to share > > > > your > > > > opinions on IT & business topics through brief surveys-and earn cash > > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > _______________________________________________ > > > > Nutch-general mailing list > > > > [email protected] > > > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > View this message in context: > > > http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273 > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > > > > -- > > "Conscious decisions by conscious minds are what make reality real" > > > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- "Conscious decisions by conscious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
