RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari
Forgot to say: a urlfilter can't do that, since its input is just the URL, 
without any metadata such as the score.

> -Original Message-
> From: Yossi Tamari [mailto:yossi.tam...@pipl.com]
> Sent: 04 December 2017 21:01
> To: user@nutch.apache.org; 'Michael Coffey' 
> Subject: RE: purging low-scoring urls
> 
> Hi Michael,
> 
> I think one way you can do it is using `readdb  -dump new_crawldb -
> format crawldb -expr "score>0.03" `.
> You would then need to use hdfs commands to replace the existing
> /current with newcrawl_db.
> Of course, I strongly recommend backing up the current crawldb before
> replacing it...
> 
>   Yossi.
> 
> > -Original Message-
> > From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> > Sent: 04 December 2017 20:38
> > To: User 
> > Subject: purging low-scoring urls
> >
> > Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> > many thousands of zero-scoring urls and also many thousands of urls with
> > scores less than 0.03. These urls will never be fetched because they will 
> > never
> > make it into the generator's topN by score. So, all they do is make the 
> > process
> > slower.
> >
> > It seems like something an urlfilter could do, but I have not found any
> > documentation for any urlfilter that does it.




RE: purging low-scoring urls

2017-12-04 Thread Yossi Tamari
Hi Michael,

I think one way you can do it is using `readdb  -dump new_crawldb 
-format crawldb -expr "score>0.03" `.
You would then need to use hdfs commands to replace the existing 
/current with newcrawl_db.
Of course, I strongly recommend backing up the current crawldb before replacing 
it...

Yossi. 

> -Original Message-
> From: Michael Coffey [mailto:mcof...@yahoo.com.INVALID]
> Sent: 04 December 2017 20:38
> To: User 
> Subject: purging low-scoring urls
> 
> Is it possible to purge low-scoring urls from the crawldb? My news crawl has
> many thousands of zero-scoring urls and also many thousands of urls with
> scores less than 0.03. These urls will never be fetched because they will 
> never
> make it into the generator's topN by score. So, all they do is make the 
> process
> slower.
> 
> It seems like something an urlfilter could do, but I have not found any
> documentation for any urlfilter that does it.



crawlcomplete

2017-12-04 Thread Yossi Tamari
Hi,

 

I'm trying to understand some of the design decisions behind the
crawlcomplete tool. I find the concept itself very useful, but there are a
couple of behaviors that I don't understand:

1.  URLs that resulted in redirect (even permanent) are counted as
unfetched. That means that if I had a crawl with only one URL, and that URL
returned a redirect, which was fetched successfully, I would see 1 FETCHED
and 1 UNFETCHED in crawlcomplete, and there is no inherent way for me to
know that, really, my crawl is 100% complete. My expectation would be for
URLs that resulted in redirection to not be counted (as they have been
replaced by new URLs), or to be counted in a separate group (which can then
be ignored).
2.  URLs that are db_gone are also counted as unfetched. It seems to me
these URLs were "successfully" crawled. It's the reality of the web that
pages disappear over time, and knowing that this happened is useful. These
URLs do not need to be crawled again, so they should not be counted as
unfetched. I can see why counting them as FETCHED would be confusing, so
maybe the names of the groups should be changed (COMPLETE and INCOMPLETE)
or a new group (GONE) added.

 

Are there good reasons for the current behavior? 



   Yossi.



purging low-scoring urls

2017-12-04 Thread Michael Coffey
Is it possible to purge low-scoring urls from the crawldb? My news crawl has 
many thousands of zero-scoring urls and also many thousands of urls with scores 
less than 0.03. These urls will never be fetched because they will never make 
it into the generator's topN by score. So, all they do is make the process 
slower.

It seems like something an urlfilter could do, but I have not found any 
documentation for any urlfilter that does it.