[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000667#comment-15000667
 ] 

Hudson commented on NUTCH-2155:
---

SUCCESS: Integrated in Nutch-trunk #3304 (See 
[https://builds.apache.org/job/Nutch-trunk/3304/])
NUTCH-2155 - Update crawlcomplete help and drop 'current' folder requirements 
(joyce: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1713885])
* trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-09 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997490#comment-14997490
 ] 

Sebastian Nagel commented on NUTCH-2155:


+1

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-02 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985431#comment-14985431
 ] 

Michael Joyce commented on NUTCH-2155:
--

+1 sounds good to me [~sebastien0], I will update it in a patch shortly

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-02 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985280#comment-14985280
 ] 

Sebastian Nagel commented on NUTCH-2155:


Yes, call it as
{noformat}
% nutch crawlcomplete crawl/crawldb ...
{noformat}
same as other tools:
{noformat}
% nutch readdb crawl/crawldb ...
% nutch inject crawl/crawldb seedurls/
{noformat}


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984806#comment-14984806
 ] 

Markus Jelsma commented on NUTCH-2155:
--

By `remove current` and `not require current` you guys mean not having it as an 
argument i assume? i.e. not crawl/crawldb/current but crawl/crawldb?


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984542#comment-14984542
 ] 

Sebastian Nagel commented on NUTCH-2155:


I would opt to make the "crawlcomplete" utility to be consistent with "readdb", 
"generate", etc. -- without current/.
The main point: it must be obvious how to use a tool. This was already an issue 
with "domainstats" where it has been solved by adding an appropriate 
command-line help (NUTCH-1911 -- sorry, [~jo...@apache.org], I've missed this). 

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984432#comment-14984432
 ] 

Chris A. Mattmann commented on NUTCH-2155:
--

Seb, shall we update it not to require current and then move forward? Thoughts? 
[~mjoyce]?

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983557#comment-14983557
 ] 

Hudson commented on NUTCH-2155:
---

SUCCESS: Integrated in Nutch-trunk #3299 (See 
[https://builds.apache.org/job/Nutch-trunk/3299/])
Fix for NUTCH-2155 Create a crawl completeness utility contributed by Michael 
Joyce  this closes #83 (mattmann: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1711560])
* trunk/CHANGES.txt
* trunk/src/bin/nutch
* trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983399#comment-14983399
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/83


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979335#comment-14979335
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43325287
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass

[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979327#comment-14979327
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324772
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass

[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979325#comment-14979325
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324656
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

+1 I agree completely @lewismc. I got a bit lazy and stole some from 
domainstats (which is also in need of some commons-cli love as well). I'll try 
to throw a patch together an address some of these issues when I get some free 
time.


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979319#comment-14979319
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324357
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

Not a absolute requirement but merely a suggestion, it would be GREAT to 
see commons-cli used here to prevent incorrect CLI usage. It also prints much 
more user friendly output when invoked without options or with '-h'.
For excellent examples of where commons-cli is already used please see 
[here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760)
 @MJJoyce.
I like this job as well, it's a neat and quick way to see domain coverage.


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979258#comment-14979258
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/83

NUTCH-2155 - Add crawl completion utility

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-2155

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/83.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #83


commit 2534b7a32417e044f5c1c39f4409a6d6826eee69
Author: Michael Joyce 
Date:   2015-10-28T21:18:16Z

NUTCH-2155 - Add crawl completion util

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.




> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979196#comment-14979196
 ] 

Michael Joyce commented on NUTCH-2155:
--

Should have a first patch up shortly for review folks

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)