Re: Removing urls from crawl db

Ferdy Galema Wed, 16 Nov 2011 00:31:54 -0800

If you want to drop certain domains after a certain while, you wouldstill need to specify them, no? Using the urlfilter-domain could beperfectly used for this scenario. After updating the crawldb you indeedneed to remove them when you want to crawl them again later on.

When you think this process is too tedious, perhaps you could reorganizethe way you crawl by using multple crawl roots. For example you have onecrawl root that is permanent (no adhoc filtering is done on this one)but on the side you have adhoc crawldbs with crawl segments, all whichhave the same solr index as output. These side crawl roots can bedeleted when done. The hard part with this is that you have to deploymultiple deployments of Nutch (with different filtering settings) andyou have to keep politeness and redundancy in mind.

Last thing that enters my mind is the fact that you can implement acustom FetchSchedule. That way you keep all records in the crawldb, butyou decide in the implementation whenever an url gets (re)fetched.


On 11/16/2011 07:28 AM, Sudip Datta wrote:

This technique looks fine if filtering is to be done once (or avoiding
crawl of certain urls which have been erronously injected in crawldb).

But is it the right way to go in larger crawls? In a scenario where
certain domains are periodically injected to be crawled and/or
removed, for some domains are no longer required to be crawled (though
these could be reintroduced again at a later stage), won't managing a
regex-urlfilter.txt become tedious as - filters increase?

Isn't there a way, where urls (from certain domains) are completely
removed from crawldb (no longer required to be crawled) and injected
at a later stage if need be?

Thanks,

--Sudip.

On Fri, Nov 11, 2011 at 2:00 AM, Markus Jelsma
<[email protected]>  wrote:

Uh, the filter checker immediately produces output.

Interesting.  What kind of output should I expect to see?  So far it's been
running for a while with no output.

On Thu, Nov 10, 2011 at 1:51 PM, Markus Jelsma

<[email protected]>wrote:

You can use bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
to test.

Okay.  So I would just put that above the +. line, right?

Thanks.

On Thu, Nov 10, 2011 at 10:42 AM, Markus Jelsma

<[email protected]>wrote:

if i want to remove example.org from my CrawlDB using regex filters

i'll

add:

-^http://example\.org/

and run updatedb with filtering enabled. The URL's will then be

deleted.

On Thursday 10 November 2011 16:36:24 Bai Shen wrote:

Can you give me an example of how would I set my URL filter to do

this?

Right now I'm just using the default.

On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma

<[email protected]>wrote:

Hi

Write an regex URL filter and use it the next time you update the

db;

it

will
disappear. Be sure to backup the db first in case your regex

catches

valid URL's. Nutch 1.5 will have an option to keep the previous
version of the DB after update.

cheers

We accidentally injected some urls into the crawl database and
I need to

go

remove them.  From what I understand, in 1.4 I can view and

modify

the

urls

and indexes.  But I can't seem to find any information on how
to

do

this.

Is there anything regarding this available?

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Removing urls from crawl db

Reply via email to