Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack? Cheers, Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Aron Ahmadia <aahma...@continuum.io> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> Date: Sunday, November 1, 2015 at 7:01 AM To: "dev@nutch.apache.org" <dev@nutch.apache.org> Subject: Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility > > > >Is this exposed to the REST API? I might be able to plot this in memex >explorer. > >On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <j...@apache.org> >wrote: > > > [ >https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.p >lugin.system.issuetabpanels:all-tabpanel ><https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira. >plugin.system.issuetabpanels:all-tabpanel> ] > >Sebastian Nagel reopened NUTCH-2155: >------------------------------------ > >When running the completion statistics on a CrawlDb, an exception is >thrown >{noformat} >% nutch crawlcomplete >usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer] >% nutch crawlcomplete ./crawl/crawldb completion_stats domain >Exception in thread "main" java.io.FileNotFoundException: File >file:.../crawl/crawldb/old/data does not exist > at >org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi >leSystem.java:511) >{noformat} >I had to take a look into the code to figure out that the parameter ><inputdirs> is expected as comma-separated list of CrawlDb sequence >files. The following command works: >{noformat} >% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain >{noformat} >All Nutch tools and utils operating on CrawlDb take just the bare path >without the current/ subdirectory. Shouldn't the crawlcomplete command >behave the same? >To pass more than one CrawlDb may be useful sometimes. However, usually >crawls (and their dbs) are disjoint. If they are not the completeness >statistics are probably not correct due to duplicates. > >> Create a "crawl completeness" utility >> ------------------------------------- >> >> Key: NUTCH-2155 >> URL: >https://issues.apache.org/jira/browse/NUTCH-2155 ><https://issues.apache.org/jira/browse/NUTCH-2155> >> Project: Nutch >> Issue Type: Improvement >> Components: util >> Affects Versions: 1.10 >> Reporter: Michael Joyce >> Assignee: Chris A. Mattmann >> Labels: memex >> Fix For: 1.11 >> >> >> I've found it useful to have a tool for dumping some "completeness" >>information from a crawl similar to how domainstats does but including >>fetched and unfetched counts per domain/host. This is especially nice >>when doing vertical crawls over a few domains > or just to see how much of a host/domain you've covered with your crawl >so far. > > > >-- >This message was sent by Atlassian JIRA >(v6.3.4#6332) > > > > >-- >_______________________________ > >Aron Ahmadia > >Computational and Data Scientist > > > <http://continuum.io> > >