[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000667#comment-15000667 ] Hudson commented on NUTCH-2155: --- SUCCESS: Integrated in Nutch-trunk #3304 (See [https://builds.apache.org/job/Nutch-trunk/3304/]) NUTCH-2155 - Update crawlcomplete help and drop 'current' folder requirements (joyce: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1713885]) * trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2155_joyce_9Nov2015.patch > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997490#comment-14997490 ] Sebastian Nagel commented on NUTCH-2155: +1 > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2155_joyce_9Nov2015.patch > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985431#comment-14985431 ] Michael Joyce commented on NUTCH-2155: -- +1 sounds good to me [~sebastien0], I will update it in a patch shortly > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985280#comment-14985280 ] Sebastian Nagel commented on NUTCH-2155: Yes, call it as {noformat} % nutch crawlcomplete crawl/crawldb ... {noformat} same as other tools: {noformat} % nutch readdb crawl/crawldb ... % nutch inject crawl/crawldb seedurls/ {noformat} > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984806#comment-14984806 ] Markus Jelsma commented on NUTCH-2155: -- By `remove current` and `not require current` you guys mean not having it as an argument i assume? i.e. not crawl/crawldb/current but crawl/crawldb? > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984542#comment-14984542 ] Sebastian Nagel commented on NUTCH-2155: I would opt to make the "crawlcomplete" utility to be consistent with "readdb", "generate", etc. -- without current/. The main point: it must be obvious how to use a tool. This was already an issue with "domainstats" where it has been solved by adding an appropriate command-line help (NUTCH-1911 -- sorry, [~jo...@apache.org], I've missed this). > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984432#comment-14984432 ] Chris A. Mattmann commented on NUTCH-2155: -- Seb, shall we update it not to require current and then move forward? Thoughts? [~mjoyce]? > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983557#comment-14983557 ] Hudson commented on NUTCH-2155: --- SUCCESS: Integrated in Nutch-trunk #3299 (See [https://builds.apache.org/job/Nutch-trunk/3299/]) Fix for NUTCH-2155 Create a crawl completeness utility contributed by Michael Joyce this closes #83 (mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1711560]) * trunk/CHANGES.txt * trunk/src/bin/nutch * trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983399#comment-14983399 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/83 > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979335#comment-14979335 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43325287 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979327#comment-14979327 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324772 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979325#comment-14979325 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324656 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- +1 I agree completely @lewismc. I got a bit lazy and stole some from domainstats (which is also in need of some commons-cli love as well). I'll try to throw a patch together an address some of these issues when I get some free time. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979319#comment-14979319 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324357 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- Not a absolute requirement but merely a suggestion, it would be GREAT to see commons-cli used here to prevent incorrect CLI usage. It also prints much more user friendly output when invoked without options or with '-h'. For excellent examples of where commons-cli is already used please see [here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760) @MJJoyce. I like this job as well, it's a neat and quick way to see domain coverage. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979258#comment-14979258 ] ASF GitHub Bot commented on NUTCH-2155: --- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/83 NUTCH-2155 - Add crawl completion utility - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2155 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 2534b7a32417e044f5c1c39f4409a6d6826eee69 Author: Michael Joyce Date: 2015-10-28T21:18:16Z NUTCH-2155 - Add crawl completion util - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979196#comment-14979196 ] Michael Joyce commented on NUTCH-2155: -- Should have a first patch up shortly for review folks > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)