If you want to drop certain domains after a certain while, you would
still need to specify them, no? Using the urlfilter-domain could be
perfectly used for this scenario. After updating the crawldb you indeed
need to remove them when you want to crawl them again later on.
When you think this process is too tedious, perhaps you could reorganize
the way you crawl by using multple crawl roots. For example you have one
crawl root that is permanent (no adhoc filtering is done on this one)
but on the side you have adhoc crawldbs with crawl segments, all which
have the same solr index as output. These side crawl roots can be
deleted when done. The hard part with this is that you have to deploy
multiple deployments of Nutch (with different filtering settings) and
you have to keep politeness and redundancy in mind.
Last thing that enters my mind is the fact that you can implement a
custom FetchSchedule. That way you keep all records in the crawldb, but
you decide in the implementation whenever an url gets (re)fetched.
On 11/16/2011 07:28 AM, Sudip Datta wrote:
This technique looks fine if filtering is to be done once (or avoiding
crawl of certain urls which have been erronously injected in crawldb).
But is it the right way to go in larger crawls? In a scenario where
certain domains are periodically injected to be crawled and/or
removed, for some domains are no longer required to be crawled (though
these could be reintroduced again at a later stage), won't managing a
regex-urlfilter.txt become tedious as - filters increase?
Isn't there a way, where urls (from certain domains) are completely
removed from crawldb (no longer required to be crawled) and injected
at a later stage if need be?
Thanks,
--Sudip.
On Fri, Nov 11, 2011 at 2:00 AM, Markus Jelsma
<[email protected]> wrote:
Uh, the filter checker immediately produces output.
Interesting. What kind of output should I expect to see? So far it's been
running for a while with no output.
On Thu, Nov 10, 2011 at 1:51 PM, Markus Jelsma
<[email protected]>wrote:
You can use bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
to test.
Okay. So I would just put that above the +. line, right?
Thanks.
On Thu, Nov 10, 2011 at 10:42 AM, Markus Jelsma
<[email protected]>wrote:
if i want to remove example.org from my CrawlDB using regex filters
i'll
add:
-^http://example\.org/
and run updatedb with filtering enabled. The URL's will then be
deleted.
On Thursday 10 November 2011 16:36:24 Bai Shen wrote:
Can you give me an example of how would I set my URL filter to do
this?
Right now I'm just using the default.
On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma
<[email protected]>wrote:
Hi
Write an regex URL filter and use it the next time you update the
db;
it
will
disappear. Be sure to backup the db first in case your regex
catches
valid URL's. Nutch 1.5 will have an option to keep the previous
version of the DB after update.
cheers
We accidentally injected some urls into the crawl database and
I need to
go
remove them. From what I understand, in 1.4 I can view and
modify
the
urls
and indexes. But I can't seem to find any information on how
to
do
this.
Is there anything regarding this available?
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350