Re: [Nutch-general] Removing pages from index immediately

Briggs Fri, 27 Apr 2007 09:19:04 -0700

Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html


You would then need to create a filter of 'pruned' urls to ignore if
they are discovered again.  This list can get quite large, but I
really don't know how else to do it.  It would be cool if we could
hack the crawldb (or webdb I believe in your version) to include a
flag of 'good/bad' or something.


On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:
> Isn't this what you are looking for?
>
> org.apache.nutch.tools.PruneIndexTool.
>
>
>
> On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
> >
> > hi Enis,
> > This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> > indexing for my search engine...
> > i read from ur message that u can delete a particular index directly?if so
> > how its possible..i m desperately searching for a clue to do this one...
> > my requirement is to delete the porn site's index from my crawled data...
> > ur help is highly needed....
> >
> > expecting u to help me in this regards ..
> >
> > Thanks in advance..
> > Franklin.S
> >
> >
> > ogjunk-nutch wrote:
> > >
> > > Hi Enis,
> > >
> > > Right, I can easily delete the page from the Lucene index, though I'd
> > > prefer to follow the Nutch protocol and avoid messing something up by
> > > touching the index directly.  However, I don't want that page to re-appear
> > > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > > will remain missing, but it would be great to be able to tell Nutch to
> > > "forget it" "from everywhere".  Is that doable?
> > > I could read and re-write the *Db Maps, but that's a lot of IO... just to
> > > get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> > > flags a given page as "forget this page as soon as possible" and it just
> > > happens later on.
> > >
> > > Thanks,
> > > Otis
> > >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > >
> > > ----- Original Message ----
> > > From: Enis Soztutar <[EMAIL PROTECTED]>
> > > To: [EMAIL PROTECTED]
> > > Sent: Thursday, April 5, 2007 3:29:55 AM
> > > Subject: Re: [Nutch-general] Removing pages from index immediately
> > >
> > > Since hadoop's map files are write once, it is not possible to delete
> > > some urls from the crawldb and linkdb. The only thing you can do is to
> > > create the map files once again without the deleted urls. But running
> > > the crawl once more as you suggested seems more appropriate. Deleting
> > > documents from the index is just lucene stuff.
> > >
> > > In your case it seems that every once in a while, you crawl the whole
> > > site, and create the indexes and db's and then just throw the old one
> > > out. And between two crawls you can delete the urls from the index.
> > >
> > > [EMAIL PROTECTED] wrote:
> > >> Hi,
> > >>
> > >> I'd like to be able to immediately remove certain pages from Nutch
> > >> (index, crawldb, linkdb...).
> > >> The scenario is that I'm using Nutch to index a single site or a set of
> > >> internal sites.  Once in a while editors of the site remove a page from
> > >> the site.  When that happens, I want to update at least the index and
> > >> ideally crawldb, linkdb, so that people searching the index don't get the
> > >> missing page in results and end up going there, hitting the 404.
> > >>
> > >> I don't think there is a "direct" way to do this with Nutch, is there?
> > >> If there really is no direct way to do this, I was thinking I'd just put
> > >> the URL of the recently removed page into the first next fetchlist and
> > >> then somehow get Nutch to immediately remove that page/URL once it hits a
> > >> 404.  How does that sound?
> > >>
> > >> Is there a way to configure Nutch to delete the page after it gets a 404
> > >> for it even just once?  I thought I saw the setting for that somewhere a
> > >> few weeks ago, but now I can't find it.
> > >>
> > >> Thanks,
> > >> Otis
> > >>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > > -------------------------------------------------------------------------
> > > Take Surveys. Earn Cash. Influence the Future of IT
> > > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > > your
> > > opinions on IT & business topics through brief surveys-and earn cash
> > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > _______________________________________________
> > > Nutch-general mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > >
> > >
> > >
> > >
> > >
> >
> > --
> > View this message in context: 
> > http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Removing pages from index immediately

Reply via email to