Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Mattmann, Chris A (3980) Sun, 01 Nov 2015 08:06:06 -0800

Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Aron Ahmadia <aahma...@continuum.io>
Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Date: Sunday, November 1, 2015 at 7:01 AM
To: "dev@nutch.apache.org" <dev@nutch.apache.org>
Subject: Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness"
utility

>
>
>
>Is this exposed to the REST API?  I might be able to plot this in memex
>explorer. 
>
>On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <j...@apache.org>
>wrote:
>
>
>     [ 
>https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:all-tabpanel
><https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.
>plugin.system.issuetabpanels:all-tabpanel> ]
>
>Sebastian Nagel reopened NUTCH-2155:
>------------------------------------
>
>When running the completion statistics on a CrawlDb, an exception is
>thrown
>{noformat}
>% nutch crawlcomplete
>usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
>% nutch crawlcomplete ./crawl/crawldb completion_stats domain
>Exception in thread "main" java.io.FileNotFoundException: File
>file:.../crawl/crawldb/old/data does not exist
>        at 
>org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi
>leSystem.java:511)
>{noformat}
>I had to take a look into the code to figure out that the parameter
><inputdirs> is expected as comma-separated list of CrawlDb sequence
>files. The following command works:
>{noformat}
>% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
>{noformat}
>All Nutch tools and utils operating on CrawlDb take just the bare path
>without the current/ subdirectory. Shouldn't the crawlcomplete command
>behave the same?
>To pass more than one CrawlDb may be useful sometimes. However, usually
>crawls (and their dbs) are disjoint. If they are not the completeness
>statistics are probably not correct due to duplicates.
>
>> Create a "crawl completeness" utility
>> -------------------------------------
>>
>>                 Key: NUTCH-2155
>>                 URL:
>https://issues.apache.org/jira/browse/NUTCH-2155
><https://issues.apache.org/jira/browse/NUTCH-2155>
>>             Project: Nutch
>>          Issue Type: Improvement
>>          Components: util
>>    Affects Versions: 1.10
>>            Reporter: Michael Joyce
>>            Assignee: Chris A. Mattmann
>>              Labels: memex
>>             Fix For: 1.11
>>
>>
>> I've found it useful to have a tool for dumping some "completeness"
>>information from a crawl similar to how domainstats does but including
>>fetched and unfetched counts per domain/host. This is especially nice
>>when doing vertical crawls over a few domains
> or just to see how much of a host/domain you've covered with your crawl
>so far.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)
>
>
>
>
>-- 
>_______________________________
>
>Aron Ahmadia
>
>Computational and Data Scientist
>
>
> <http://continuum.io>
>
>

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Reply via email to