We used nutch for whole web crawling.
In infinite loop we run tasks:
1) bin/nutch generate db segmentsPath -topN 1
2) bin/nutch fetch segment name
3) bin/nutch updatedb db segment name
4) bin/nutch analyze db segment name
5) bin/nutch index segment name
6) bin/nutch dedup segments
Sami Siren wrote:
+ if (k.contains(score)) {
Since:
1.5
Ah, indeed. Fixed - thanks!
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix,
[EMAIL PROTECTED] wrote:
Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!).
The reader offers similar functionality to the classic readdb command.
This looks great! Thanks, Andrzej.
I just ran it on a 50M page crawl. It took longer than I expected. The
reduce
Doug Cutting wrote:
I just ran it on a 50M page crawl.
FYI, here's the output:
051123 191703 TOTAL urls: 167780785
051123 191703 avg score:1.152
051123 191703 max score:47357.137
051123 191703 min score:1.0
051123 191703 retry 0: 167780785
051123 191703 status 1