[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979896#comment-14979896
 ] 

Chris A. Mattmann commented on NUTCH-1800:
--

awesome Lewis +1 to commit this and move forward. Cheers!

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979834#comment-14979834
 ] 

Lewis John McGibbney commented on NUTCH-1800:
-

Improvements can be seen here 
http://people.apache.org/~lewismc/miredot/#warnings

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979831#comment-14979831
 ] 

Lewis John McGibbney commented on NUTCH-1800:
-

For those who want to see the docs you can see them here
http://people.apache.org/~lewismc/miredot/
Once we publish the docs with the next release I'll remove this.
Thanks

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979830#comment-14979830
 ] 

Sujen Shah commented on NUTCH-1800:
---

Thanks Lewis for this, it is going to be really helpful to the community :) 

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1800:

 Flags: Patch
Patch Info: Patch Available

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979828#comment-14979828
 ] 

Lewis John McGibbney commented on NUTCH-1800:
-

[~sujenshah]

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1800:

Attachment: NUTCH-1800.patch

Patch for trunk. Currently uses my own license key (which is OK) we will switch 
this out for a more stable license key once the guy Yves gets back to us from 
Miredot.

In order to see the REST API documentation, you need to 
 * download ant-maven-tasks from Maven Central and put it into $NUTCH_HOME/ivy/
 * execute 
{code}
ant -lib ivy restdocs
{code}

The reason that this is a bit messy is simply because Miredot only has builds 
for Maven and Gradle. This patch works around that by utilizing Maven Ant Tasks.

Excellent work to everyone who has been developing and augmenting the REST 
API... there are a bunch of improvements suggested by Miredot which will make 
our REST API documentation braw!

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1800:
---

Assignee: Lewis John McGibbney

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1800:

Fix Version/s: (was: 2.4)
   2.3.1
   1.11

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: MireDot user activation

2015-10-28 Thread lewis john mcgibbney
Good Evening Yves,

I'm contacting you on behalf of the Apache Nutch project management team.
Apache Nutch [0] is a top level open source software project at the Apache
Software Foundation licensed under the Apache License v2.0.
I am writing to request for a license key for our project. I requested one
for the Apache Tika project a while back and we now release kick ass REST
documentation with each release. It would be my intention to do the same
with Apache Nutch if possible.
Thank you in advance for any feedback you have on this one, it is greatly
appreciated.
Lewis John McGibbney
(On behalf of the Apache Nutch Project Management Committee)

[0] http://nutch.apache.org

On Wed, Oct 28, 2015 at 9:56 PM, Yves Vandewoude 
wrote:

> Hi Lewis McGibbney,
>
> A new MireDot account has been created for you.
>
> Click the url below to activate your account and select a password!
>
>

http://people.apache.org/~lewismc || @hectorMcSpector ||
http://www.linkedin.com/in/lmcgibbney

  Apache Gora V.P || Apache Nutch PMC || Apache Any23 V.P ||
Apache OODT PMC
   Apache Open Climate Workbench PMC || Apache Tika PMC || Apache TAC
Apache Usergrid || Apache HTrace (incubating) || Apache CommonsRDF
(incubating)


[jira] [Created] (NUTCH-2156) Dump via Services end point

2015-10-28 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2156:
-

 Summary: Dump via Services end point 
 Key: NUTCH-2156
 URL: https://issues.apache.org/jira/browse/NUTCH-2156
 Project: Nutch
  Issue Type: Sub-task
  Components: REST_api
Reporter: Sujen Shah
Assignee: Sujen Shah
 Fix For: 1.12


Expose the ./bin/nutch dump command via the REST api. 

Please review the documentation of the api design on 
http://docs.apachenutchrestapi.apiary.io/# and give your feedbacks. 

Thank you all for your inputs :) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979335#comment-14979335
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43325287
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass

[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43325287
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass(LongWritable.class);
+
+job.setMapperClass(CrawlCompletionStatsMapper.class);
+job.setReducerClass(CrawlCompletionStatsReducer.class);
+job.setCombinerClass(CrawlCompletionStatsCombiner.class);
+job.setNumReduceTasks(nu

[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979327#comment-14979327
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324772
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass

[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324772
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
+  System.out
+  .println("usage: CrawlCompletionStats inputDirs outDir 
host|domain [numOfReducer]");
+  return 1;
+}
+String inputDir = args[0];
+String outputDir = args[1];
+int numOfReducers = 1;
+
+if (args.length > 3) {
+  numOfReducers = Integer.parseInt(args[3]);
+}
+
+SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss");
+long start = System.currentTimeMillis();
+LOG.info("CrawlCompletionStats: starting at " + sdf.format(start));
+
+int mode = 0;
+String jobName = "CrawlCompletionStats";
+if (args[2].equals("host")) {
+  jobName = "Host CrawlCompletionStats";
+  mode = MODE_HOST;
+} else if (args[2].equals("domain")) {
+  jobName = "Domain CrawlCompletionStats";
+  mode = MODE_DOMAIN;
+}
+
+Configuration conf = getConf();
+conf.setInt("domain.statistics.mode", mode);
+conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", 
false);
+
+Job job = Job.getInstance(conf, jobName);
+job.setJarByClass(CrawlCompletionStats.class);
+
+String[] inputDirsSpecs = inputDir.split(",");
+for (int i = 0; i < inputDirsSpecs.length; i++) {
+  FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i]));
+}
+
+job.setInputFormatClass(SequenceFileInputFormat.class);
+FileOutputFormat.setOutputPath(job, new Path(outputDir));
+job.setOutputFormatClass(TextOutputFormat.class);
+
+job.setMapOutputKeyClass(Text.class);
+job.setMapOutputValueClass(LongWritable.class);
+job.setOutputKeyClass(Text.class);
+job.setOutputValueClass(LongWritable.class);
+
+job.setMapperClass(CrawlCompletionStatsMapper.class);
+job.setReducerClass(CrawlCompletionStatsReducer.class);
+job.setCombinerClass(CrawlCompletionStatsCombiner.class);
+job.setNumReduceTasks(nu

[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979325#comment-14979325
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324656
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

+1 I agree completely @lewismc. I got a bit lazy and stole some from 
domainstats (which is also in need of some commons-cli love as well). I'll try 
to throw a patch together an address some of these issues when I get some free 
time.


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
Github user MJJoyce commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324656
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

+1 I agree completely @lewismc. I got a bit lazy and stole some from 
domainstats (which is also in need of some commons-cli love as well). I'll try 
to throw a patch together an address some of these issues when I get some free 
time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread lewismc
Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324357
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

Not a absolute requirement but merely a suggestion, it would be GREAT to 
see commons-cli used here to prevent incorrect CLI usage. It also prints much 
more user friendly output when invoked without options or with '-h'.
For excellent examples of where commons-cli is already used please see 
[here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760)
 @MJJoyce.
I like this job as well, it's a neat and quick way to see domain coverage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979319#comment-14979319
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/83#discussion_r43324357
  
--- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java ---
@@ -0,0 +1,189 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.IOException;
+import java.net.URL;
+import java.text.SimpleDateFormat;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.LongWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.Reducer;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.TimingUtil;
+import org.apache.nutch.util.URLUtil;
+
+/**
+ * Extracts some simple crawl completion stats from the crawldb
+ *
+ * Stats will be sorted by host/domain and will be of the form:
+ * 1   www.spitzer.caltech.edu FETCHED
+ * 50  www.spitzer.caltech.edu UNFETCHED
+ *
+ */
+public class CrawlCompletionStats extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(CrawlCompletionStats.class);
+
+  private static final int MODE_HOST = 1;
+  private static final int MODE_DOMAIN = 2;
+
+  private int mode = 0;
+
+  public int run(String[] args) throws Exception {
+if (args.length < 2) {
--- End diff --

Not a absolute requirement but merely a suggestion, it would be GREAT to 
see commons-cli used here to prevent incorrect CLI usage. It also prints much 
more user friendly output when invoked without options or with '-h'.
For excellent examples of where commons-cli is already used please see 
[here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760)
 @MJJoyce.
I like this job as well, it's a neat and quick way to see domain coverage.


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility

2015-10-28 Thread MJJoyce
GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/83

NUTCH-2155 - Add crawl completion utility

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-2155

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/83.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #83


commit 2534b7a32417e044f5c1c39f4409a6d6826eee69
Author: Michael Joyce 
Date:   2015-10-28T21:18:16Z

NUTCH-2155 - Add crawl completion util

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979258#comment-14979258
 ] 

ASF GitHub Bot commented on NUTCH-2155:
---

GitHub user MJJoyce opened a pull request:

https://github.com/apache/nutch/pull/83

NUTCH-2155 - Add crawl completion utility

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MJJoyce/nutch NUTCH-2155

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/83.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #83


commit 2534b7a32417e044f5c1c39f4409a6d6826eee69
Author: Michael Joyce 
Date:   2015-10-28T21:18:16Z

NUTCH-2155 - Add crawl completion util

- Add simple crawl completion utility that reports count of fetch and
  unfetched pages per domain or host.
- Update "nutch" helper script with new utility command.




> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2155:


 Summary: Create a "crawl completeness" utility
 Key: NUTCH-2155
 URL: https://issues.apache.org/jira/browse/NUTCH-2155
 Project: Nutch
  Issue Type: Improvement
  Components: util
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.12


I've found it useful to have a tool for dumping some "completeness" information 
from a crawl similar to how domainstats does but including fetched and 
unfetched counts per domain/host. This is especially nice when doing vertical 
crawls over a few domains or just to see how much of a host/domain you've 
covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979196#comment-14979196
 ] 

Michael Joyce commented on NUTCH-2155:
--

Should have a first patch up shortly for review folks

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney

2015-10-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=7&rev2=8

  
  = Class Diagram =
  
+ Below is a thumbnail Class Diagram representing the Java Class ecosystem for 
WebGraph. 
+ You can click on the thumbnail for a much larger, downloadable picture.
+ 
+ [[attachment:NutchWebGraph.png|{{attachment:NutchWebGraph.png||width=100}}]] 
+ 


[Nutch Wiki] New attachment added to page NewScoringIndexingExample

2015-10-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page "NewScoringIndexingExample" for change 
notification. An attachment has been added to that page by LewisJohnMcgibbney. 
Following detailed information is available:

Attachment name: NutchWebGraph.png
Attachment size: 859412
Attachment link: 
https://wiki.apache.org/nutch/NewScoringIndexingExample?action=AttachFile&do=get&target=NutchWebGraph.png
Page link: https://wiki.apache.org/nutch/NewScoringIndexingExample


[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney

2015-10-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=6&rev2=7

  
  '''N.B.''' This page and the functionality described within is only 
applicable and relevant to Nutch 1.X.
  
+ <>
+ 
+ = Introduction =
+ 
  Below is an example of running the new scoring and indexing systems from 
start to finish.  This was done with a sample of 1000 urls and I ran two 
different fetch cycles.  The first being 1000 urls and the second being the top 
2000 urls.  The loops job is optional but included for completeness.  In 
production we have actually removed that job.  This was done with a clean pull 
from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released).  If 
anybody has any problems running these commands or has questions send me an 
email or send one to the nutch users or dev list and I will reply.  Please send 
it to kubes at the apache address dot org.
  
+ = Workflow =
  
  {{{
  bin/nutch inject crawl/crawldb crawl/urls/
@@ -18, +23 @@

  }}}
  
  One thing to point out here is that WebGraph is meant to be used on larger 
web crawls to create web graphs.  By default it ignores outlinks to pages in 
the same domain, including subdomains, and pages with the same hostname.  It 
also limits to one outlink per page to links in the same page or the same 
domain.  All of these options are changeable through the following 
configuration options:
+ 
+ = Configuration =
  
  {{{
  
@@ -47, +54 @@

   
  
  }}}
+ 
+ = Additional WebGraph Classes =
  
  But by default if you are only crawling pages within a domain or within a set 
of subdomains, all outlinks will be ignored and you will come up with an empty 
webgraph.  This in turn will throw an error while processing through the 
LinkRank job.  The flip side is by NOT ignoring links to the same domain/host 
and by not limiting those links, the webgraph becomes much, much more dense and 
hence there is a lot more links to process which probably won't affect 
relevancy as much.
  
@@ -163, +172 @@

  bin/nutch org.apache.nutch.indexer.field.FieldIndexer -fields 
crawl/fields/basicfields/ -fields crawl/fields/anchorfields/ -output 
crawl/indexes
  }}}
  
+ = Class Diagram =
+ 


[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney

2015-10-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=5&rev2=6

  bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ 
-webgraphdb crawl/webgraphdb
  }}}
  
- One thing that has been brought up is the -segment flag on webgraph.  If you 
have more than one segment then you would have more than one segment flag as 
shown above.
+ One thing that has been brought up is the -segment flag on webgraph.  If you 
have more than one segment then you would use the -segmentDir flag available on 
the command line interface.
  
  {{{
  bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb 
crawl/webgraphdb/


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978963#comment-14978963
 ] 

Aron Ahmadia commented on NUTCH-2132:
-




Thanks for that guidance :)


> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978952#comment-14978952
 ] 

Sujen Shah commented on NUTCH-2132:
---

Yes this is taken care of in the second patch. And, apply the second patch 
against trunk and not on top of the first one. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978945#comment-14978945
 ] 

Aron Ahmadia commented on NUTCH-2132:
-

got it.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978944#comment-14978944
 ] 

Aron Ahmadia commented on NUTCH-2132:
-

I think the protection belongs in   

public void publish(FetcherThreadEvent event) {
publisher.publish(event);
  }

This call should only dispatch if publisher has been initialized.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978942#comment-14978942
 ] 

Sujen Shah commented on NUTCH-2132:
---

Yes the first patch does not have that property, it was incorporated in the 
second one. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978938#comment-14978938
 ] 

Aron Ahmadia commented on NUTCH-2132:
-

I'm observing crashes when fetcher.publisher is set to false.  This is with 
your first patch applied but not the second.

fetch of http://www.google.com/ failed with: java.lang.NullPointerException
at 
org.apache.nutch.fetcher.FetcherThreadPublisher.publish(FetcherThreadPublisher.java:43)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:253)

fetch of http://aron.ahmadia.net/ failed with: java.lang.NullPointerException
at 
org.apache.nutch.fetcher.FetcherThreadPublisher.publish(FetcherThreadPublisher.java:43)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:253)


> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978911#comment-14978911
 ] 

Sujen Shah commented on NUTCH-2132:
---

[~ahmadia], 
bq. One issue I'm having is that if I start a Nutch server with this patch and 
RMQ configured to publish, crawls fail unless a RMQ server is available.
- I'll look into this and upload a new patch. Thanks for pointing this out. 


bq. Barring that, is it possible to reconfigure the server using the config 
REST endpoint to not publish
- Yes, if you change the config parameter "fetcher.publisher" to false, it 
should stop publishing and vice versa. I have tested the case where it is false 
by default and configured nutch to start publishing, not tried the reverse. 

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978908#comment-14978908
 ] 

Aron Ahmadia commented on NUTCH-2132:
-

Also, the vice-versa situation is important as well.  Can I start up a Nutch 
server that isn't configured to publish, then turn publishing on?

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978902#comment-14978902
 ] 

Aron Ahmadia commented on NUTCH-2132:
-

[~sujenshah] - I'm reviewing this again now.  One issue I'm having is that if I 
start a Nutch server with this patch and RMQ configured to publish, crawls fail 
unless a RMQ server is available.

Since the point of a pub/sub model is that these sort of messages are ephemeral 
(and the routing exchanges may go up and down), it should not be a fatal error 
for RMQ to go down.

Barring that, is it possible to reconfigure the server using the config REST 
endpoint to not publish?  Testing now...

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978854#comment-14978854
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

Yeah I think we may want to do something async here too and use GET. Let's 
think about this. It may be a 1.12+ improvement though. At a minimum I think we 
can update to GET for 1.11.

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978841#comment-14978841
 ] 

Aron Ahmadia commented on NUTCH-2153:
-

If it's asynchronous, use a POST and return a crawldb_job identifier that can 
be used to query if the job is complete.

I've got mixed feelings on the synchronous case.  I'm happy to follow 
Mattmann's guidance on this.

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978824#comment-14978824
 ] 

Aron Ahmadia commented on NUTCH-2154:
-

Looks like it's assumed that "args" is passed in to the REST query, even though 
it's for optional arguments.  

In [3]: cc.jobClient.stats()
nutch.py: POST Endpoint: /db/crawldb
nutch.py: POST Request data: {'args': {}, 'type': 'stats', 'crawlId': 
'crawl_aahmadia_2015-10-28T13_27_31.807902', 'confId': 'default'}
nutch.py: POST Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:27:36 GMT', 
'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 
'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'status': {u'1': {u'count': u'2', u'statusValue': 
u'db_unfetched'}}, u'retry 0': u'2', u'avgScore': u'1.0', u'totalUrls': u'2', 
u'minScore': u'1.0', u'maxScore': u'1.0'}
Out[3]:
{u'avgScore': u'1.0',
 u'maxScore': u'1.0',
 u'minScore': u'1.0',
 u'retry 0': u'2',
 u'status': {u'1': {u'count': u'2', u'statusValue': u'db_unfetched'}},
 u'totalUrls': u'2'}

Perhaps the right thing to do is write some protection around args being 
undefined and initialize it as an empty hash map?

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.htt

[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978826#comment-14978826
 ] 

Sujen Shah commented on NUTCH-2153:
---

Hi [~ahmadia] and [~chrismattmann], 

Currently, while using Nutch REST services in local mode, the crawldb job gets 
executed pretty fast. But if the same is used in a distributed mode, the 
crawldb job can take up a fair amount of time. So issuing a GET request would 
make the client wait for a long time for the response. 
A POST request was used since the crawldb resource is created once a user 
issues a request and not precomputed (which is usually the case when a GET is 
used). The /db endpoint still requires development in the part where it can 
spin up threads for computation like the /job endpoint, and then provide a GET 
interface to query results.

I have tried to use the same concept in the commoncrawldump service as that 
might also take up time as the amount of data crawled increases. 

I would like to know what are your thoughts to handle such cases, where issuing 
a GET requires computation of the resource. 

Thanks!

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978811#comment-14978811
 ] 

Chris A. Mattmann commented on NUTCH-2154:
--

I have to respin 1.11 anyways, so I'll take a look at this real quick 
[~ahmadia] thanks!

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2154:


Assignee: Chris A. Mattmann

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2154:
-
Fix Version/s: 1.11

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Aron Ahmadia (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aron Ahmadia updated NUTCH-2153:

Affects Version/s: (was: 1.10)
   1.11

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Aron Ahmadia (JIRA)
Aron Ahmadia created NUTCH-2154:
---

 Summary: Nutch REST API (DB) suffering NullPointerException
 Key: NUTCH-2154
 URL: https://issues.apache.org/jira/browse/NUTCH-2154
 Project: Nutch
  Issue Type: Bug
  Components: REST_api
Affects Versions: 1.11
Reporter: Aron Ahmadia
Priority: Minor


Not sure what's causing this.  I tried this request both before and after a 
crawl had completed.

nutch.py: POST Endpoint: /db/crawldb
nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
nutch.py: POST Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 500

nutch log:

java.lang.NullPointerException
at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
at 
org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
at 
org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
at 
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978769#comment-14978769
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

Gotcha, thanks [~ahmadia]

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Aron Ahmadia (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978764#comment-14978764
 ] 

Aron Ahmadia commented on NUTCH-2153:
-

The API from https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI:

POST /db/crawldb with following
{ "type":"stats",
  "confId":"default",
  "crawlId":"crawl01",
  "args":{"someParam":"someValue"}
}

uses a POST to request information (stats).

This should be a GET.

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978748#comment-14978748
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

can you be more specific here, [~ahmadia]?

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Aron Ahmadia (JIRA)
Aron Ahmadia created NUTCH-2153:
---

 Summary: Nutch REST API (DB) uses POST instead of GET to request
 Key: NUTCH-2153
 URL: https://issues.apache.org/jira/browse/NUTCH-2153
 Project: Nutch
  Issue Type: Bug
  Components: REST_api
Affects Versions: 1.10
Reporter: Aron Ahmadia
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)