hi Enis,
This is franklin ..currently i m using nutch 0.7.2 for my crawling and
indexing for my search engine...
i read from ur message that u can delete a particular index directly?if so
how its possible..i m desperately searching for a clue to do this one...
my requirement is to delete the porn site's index from my crawled data...
ur help is highly needed....

expecting u to help me in this regards ..

Thanks in advance..
Franklin.S
 

ogjunk-nutch wrote:
> 
> Hi Enis,
> 
> Right, I can easily delete the page from the Lucene index, though I'd
> prefer to follow the Nutch protocol and avoid messing something up by
> touching the index directly.  However, I don't want that page to re-appear
> in one of the subsequent fetches.  Well, it won't re-appear, because it
> will remain missing, but it would be great to be able to tell Nutch to
> "forget it" "from everywhere".  Is that doable?
> I could read and re-write the *Db Maps, but that's a lot of IO... just to
> get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> flags a given page as "forget this page as soon as possible" and it just
> happens later on.
> 
> Thanks,
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> 
> ----- Original Message ----
> From: Enis Soztutar <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, April 5, 2007 3:29:55 AM
> Subject: Re: [Nutch-general] Removing pages from index immediately
> 
> Since hadoop's map files are write once, it is not possible to delete 
> some urls from the crawldb and linkdb. The only thing you can do is to 
> create the map files once again without the deleted urls. But running 
> the crawl once more as you suggested seems more appropriate. Deleting 
> documents from the index is just lucene stuff.
> 
> In your case it seems that every once in a while, you crawl the whole 
> site, and create the indexes and db's and then just throw the old one 
> out. And between two crawls you can delete the urls from the index.
> 
> [EMAIL PROTECTED] wrote:
>> Hi,
>>
>> I'd like to be able to immediately remove certain pages from Nutch
>> (index, crawldb, linkdb...).
>> The scenario is that I'm using Nutch to index a single site or a set of
>> internal sites.  Once in a while editors of the site remove a page from
>> the site.  When that happens, I want to update at least the index and
>> ideally crawldb, linkdb, so that people searching the index don't get the
>> missing page in results and end up going there, hitting the 404.
>>
>> I don't think there is a "direct" way to do this with Nutch, is there?
>> If there really is no direct way to do this, I was thinking I'd just put
>> the URL of the recently removed page into the first next fetchlist and
>> then somehow get Nutch to immediately remove that page/URL once it hits a
>> 404.  How does that sound?
>>
>> Is there a way to configure Nutch to delete the page after it gets a 404
>> for it even just once?  I thought I saw the setting for that somewhere a
>> few weeks ago, but now I can't find it.
>>
>> Thanks,
>> Otis
>>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>>
>>
>>   
> 
> 
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to