[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984806#comment-14984806
 ] 

Markus Jelsma commented on NUTCH-2155:
--

By `remove current` and `not require current` you guys mean not having it as an 
argument i assume? i.e. not crawl/crawldb/current but crawl/crawldb?


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing

2015-11-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-1911:


The commit has been lost.

> Imeprove DomainStatistics tool command line parsing
> ---
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.10
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984542#comment-14984542
 ] 

Sebastian Nagel commented on NUTCH-2155:


I would opt to make the "crawlcomplete" utility to be consistent with "readdb", 
"generate", etc. -- without current/.
The main point: it must be obvious how to use a tool. This was already an issue 
with "domainstats" where it has been solved by adding an appropriate 
command-line help (NUTCH-1911 -- sorry, [~jo...@apache.org], I've missed this). 

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Mattmann, Chris A (3980)
Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Aron Ahmadia 
Reply-To: "dev@nutch.apache.org" 
Date: Sunday, November 1, 2015 at 7:01 AM
To: "dev@nutch.apache.org" 
Subject: Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness"
utility

>
>
>
>Is this exposed to the REST API?  I might be able to plot this in memex
>explorer. 
>
>On Sunday, November 1, 2015, Sebastian Nagel (JIRA) 
>wrote:
>
>
> [ 
>https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:all-tabpanel
>plugin.system.issuetabpanels:all-tabpanel> ]
>
>Sebastian Nagel reopened NUTCH-2155:
>
>
>When running the completion statistics on a CrawlDb, an exception is
>thrown
>{noformat}
>% nutch crawlcomplete
>usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
>% nutch crawlcomplete ./crawl/crawldb completion_stats domain
>Exception in thread "main" java.io.FileNotFoundException: File
>file:.../crawl/crawldb/old/data does not exist
>at 
>org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi
>leSystem.java:511)
>{noformat}
>I had to take a look into the code to figure out that the parameter
> is expected as comma-separated list of CrawlDb sequence
>files. The following command works:
>{noformat}
>% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
>{noformat}
>All Nutch tools and utils operating on CrawlDb take just the bare path
>without the current/ subdirectory. Shouldn't the crawlcomplete command
>behave the same?
>To pass more than one CrawlDb may be useful sometimes. However, usually
>crawls (and their dbs) are disjoint. If they are not the completeness
>statistics are probably not correct due to duplicates.
>
>> Create a "crawl completeness" utility
>> -
>>
>> Key: NUTCH-2155
>> URL:
>https://issues.apache.org/jira/browse/NUTCH-2155
>
>> Project: Nutch
>>  Issue Type: Improvement
>>  Components: util
>>Affects Versions: 1.10
>>Reporter: Michael Joyce
>>Assignee: Chris A. Mattmann
>>  Labels: memex
>> Fix For: 1.11
>>
>>
>> I've found it useful to have a tool for dumping some "completeness"
>>information from a crawl similar to how domainstats does but including
>>fetched and unfetched counts per domain/host. This is especially nice
>>when doing vertical crawls over a few domains
> or just to see how much of a host/domain you've covered with your crawl
>so far.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)
>
>
>
>
>-- 
>___
>
>Aron Ahmadia
>
>Computational and Data Scientist
>
>
> 
>
>



[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984433#comment-14984433
 ] 

Chris A. Mattmann commented on NUTCH-2150:
--

Again - the solution here is to remove the need for current?

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984432#comment-14984432
 ] 

Chris A. Mattmann commented on NUTCH-2155:
--

Seb, shall we update it not to require current and then move forward? Thoughts? 
[~mjoyce]?

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Aron Ahmadia
Is this exposed to the REST API?  I might be able to plot this in memex
explorer.

On Sunday, November 1, 2015, Sebastian Nagel (JIRA)  wrote:

>
>  [
> https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Sebastian Nagel reopened NUTCH-2155:
> 
>
> When running the completion statistics on a CrawlDb, an exception is thrown
> {noformat}
> % nutch crawlcomplete
> usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
> % nutch crawlcomplete ./crawl/crawldb completion_stats domain
> Exception in thread "main" java.io.FileNotFoundException: File
> file:.../crawl/crawldb/old/data does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> {noformat}
> I had to take a look into the code to figure out that the parameter
>  is expected as comma-separated list of CrawlDb sequence files.
> The following command works:
> {noformat}
> % nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
> {noformat}
> All Nutch tools and utils operating on CrawlDb take just the bare path
> without the current/ subdirectory. Shouldn't the crawlcomplete command
> behave the same?
> To pass more than one CrawlDb may be useful sometimes. However, usually
> crawls (and their dbs) are disjoint. If they are not the completeness
> statistics are probably not correct due to duplicates.
>
> > Create a "crawl completeness" utility
> > -
> >
> > Key: NUTCH-2155
> > URL: https://issues.apache.org/jira/browse/NUTCH-2155
> > Project: Nutch
> >  Issue Type: Improvement
> >  Components: util
> >Affects Versions: 1.10
> >Reporter: Michael Joyce
> >Assignee: Chris A. Mattmann
> >  Labels: memex
> > Fix For: 1.11
> >
> >
> > I've found it useful to have a tool for dumping some "completeness"
> information from a crawl similar to how domainstats does but including
> fetched and unfetched counts per domain/host. This is especially nice when
> doing vertical crawls over a few domains or just to see how much of a
> host/domain you've covered with your crawl so far.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


-- 
___

Aron Ahmadia
Computational and Data Scientist

[image: Continuum Analytics] 


[jira] [Reopened] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2150:


Fails with an exception (see 
[NUTCH-2155|https://issues.apache.org/jira/browse/NUTCH-2155?focusedCommentId=14984379]):
{noformat}
% nutch protocolstats
usage: ProtocolStatistics   [numOfReducer]
% nutch protocolstats ./crawl/crawldb/ protocol_stats
./crawl/crawldb/
protocol_stats
Exception in thread "main" java.io.FileNotFoundException: File 
file:.../crawl/crawldb/old/data does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
{noformat}

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2155:


When running the completion statistics on a CrawlDb, an exception is thrown
{noformat}
% nutch crawlcomplete
usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
% nutch crawlcomplete ./crawl/crawldb completion_stats domain
Exception in thread "main" java.io.FileNotFoundException: File 
file:.../crawl/crawldb/old/data does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
{noformat}
I had to take a look into the code to figure out that the parameter  
is expected as comma-separated list of CrawlDb sequence files. The following 
command works:
{noformat}
% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
{noformat}
All Nutch tools and utils operating on CrawlDb take just the bare path without 
the current/ subdirectory. Shouldn't the crawlcomplete command behave the same?
To pass more than one CrawlDb may be useful sometimes. However, usually crawls 
(and their dbs) are disjoint. If they are not the completeness statistics are 
probably not correct due to duplicates.

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)