Isn't this what you are looking for?

org.apache.nutch.tools.PruneIndexTool.



On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
>
> hi Enis,
> This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> indexing for my search engine...
> i read from ur message that u can delete a particular index directly?if so
> how its possible..i m desperately searching for a clue to do this one...
> my requirement is to delete the porn site's index from my crawled data...
> ur help is highly needed....
>
> expecting u to help me in this regards ..
>
> Thanks in advance..
> Franklin.S
>
>
> ogjunk-nutch wrote:
> >
> > Hi Enis,
> >
> > Right, I can easily delete the page from the Lucene index, though I'd
> > prefer to follow the Nutch protocol and avoid messing something up by
> > touching the index directly.  However, I don't want that page to re-appear
> > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > will remain missing, but it would be great to be able to tell Nutch to
> > "forget it" "from everywhere".  Is that doable?
> > I could read and re-write the *Db Maps, but that's a lot of IO... just to
> > get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> > flags a given page as "forget this page as soon as possible" and it just
> > happens later on.
> >
> > Thanks,
> > Otis
> >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >
> > ----- Original Message ----
> > From: Enis Soztutar <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Thursday, April 5, 2007 3:29:55 AM
> > Subject: Re: [Nutch-general] Removing pages from index immediately
> >
> > Since hadoop's map files are write once, it is not possible to delete
> > some urls from the crawldb and linkdb. The only thing you can do is to
> > create the map files once again without the deleted urls. But running
> > the crawl once more as you suggested seems more appropriate. Deleting
> > documents from the index is just lucene stuff.
> >
> > In your case it seems that every once in a while, you crawl the whole
> > site, and create the indexes and db's and then just throw the old one
> > out. And between two crawls you can delete the urls from the index.
> >
> > [EMAIL PROTECTED] wrote:
> >> Hi,
> >>
> >> I'd like to be able to immediately remove certain pages from Nutch
> >> (index, crawldb, linkdb...).
> >> The scenario is that I'm using Nutch to index a single site or a set of
> >> internal sites.  Once in a while editors of the site remove a page from
> >> the site.  When that happens, I want to update at least the index and
> >> ideally crawldb, linkdb, so that people searching the index don't get the
> >> missing page in results and end up going there, hitting the 404.
> >>
> >> I don't think there is a "direct" way to do this with Nutch, is there?
> >> If there really is no direct way to do this, I was thinking I'd just put
> >> the URL of the recently removed page into the first next fetchlist and
> >> then somehow get Nutch to immediately remove that page/URL once it hits a
> >> 404.  How does that sound?
> >>
> >> Is there a way to configure Nutch to delete the page after it gets a 404
> >> for it even just once?  I thought I saw the setting for that somewhere a
> >> few weeks ago, but now I can't find it.
> >>
> >> Thanks,
> >> Otis
> >>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >>
> >>
> >>
> >>
> >
> >
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > opinions on IT & business topics through brief surveys-and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > _______________________________________________
> > Nutch-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> >
> >
> >
> >
> >
>
> --
> View this message in context: 
> http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to