mapred crawl

2005-11-23 Thread Anton Potehin
We used nutch for whole web crawling. In infinite loop we run tasks: 1) bin/nutch generate db segmentsPath -topN 1 2) bin/nutch fetch segment name 3) bin/nutch updatedb db segment name 4) bin/nutch analyze db segment name 5) bin/nutch index segment name 6) bin/nutch dedup segments

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Andrzej Bialecki
Sami Siren wrote: + if (k.contains(score)) { Since: 1.5 Ah, indeed. Fixed - thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix,

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Implement a reader for CrawlDB, loosely inspired by NUTCH-114 (thanks Stefan!). The reader offers similar functionality to the classic readdb command. This looks great! Thanks, Andrzej. I just ran it on a 50M page crawl. It took longer than I expected. The reduce

Re: svn commit: r348431 - in /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReader.java

2005-11-23 Thread Doug Cutting
Doug Cutting wrote: I just ran it on a 50M page crawl. FYI, here's the output: 051123 191703 TOTAL urls: 167780785 051123 191703 avg score:1.152 051123 191703 max score:47357.137 051123 191703 min score:1.0 051123 191703 retry 0: 167780785 051123 191703 status 1