[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979896#comment-14979896 ] Chris A. Mattmann commented on NUTCH-1800: -- awesome Lewis +1 to commit this and move forward. Cheers! > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979834#comment-14979834 ] Lewis John McGibbney commented on NUTCH-1800: - Improvements can be seen here http://people.apache.org/~lewismc/miredot/#warnings > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979831#comment-14979831 ] Lewis John McGibbney commented on NUTCH-1800: - For those who want to see the docs you can see them here http://people.apache.org/~lewismc/miredot/ Once we publish the docs with the next release I'll remove this. Thanks > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979830#comment-14979830 ] Sujen Shah commented on NUTCH-1800: --- Thanks Lewis for this, it is going to be really helpful to the community :) > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1800: Flags: Patch Patch Info: Patch Available > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979828#comment-14979828 ] Lewis John McGibbney commented on NUTCH-1800: - [~sujenshah] > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1800: Attachment: NUTCH-1800.patch Patch for trunk. Currently uses my own license key (which is OK) we will switch this out for a more stable license key once the guy Yves gets back to us from Miredot. In order to see the REST API documentation, you need to * download ant-maven-tasks from Maven Central and put it into $NUTCH_HOME/ivy/ * execute {code} ant -lib ivy restdocs {code} The reason that this is a bit messy is simply because Miredot only has builds for Maven and Gradle. This patch works around that by utilizing Maven Ant Tasks. Excellent work to everyone who has been developing and augmenting the REST API... there are a bunch of improvements suggested by Miredot which will make our REST API documentation braw! > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > Attachments: NUTCH-1800.patch > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1800: --- Assignee: Lewis John McGibbney > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs
[ https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1800: Fix Version/s: (was: 2.4) 2.3.1 1.11 > Documentation for Nutch 1.X and 2.X REST APIs > - > > Key: NUTCH-1800 > URL: https://issues.apache.org/jira/browse/NUTCH-1800 > Project: Nutch > Issue Type: Bug > Components: documentation, REST_api >Reporter: Lewis John McGibbney > Fix For: 1.11, 2.3.1 > > > This issue should build on NUTCH-1769 with full Java documentation for all > classes in the following packages > org.apache.nutch.api.* > I am assigning this one to [~fjodor.vershinin] as he is doing an excellent > job on the REST API. His UML graphic in [0] and commantary shows that he has > a goo dunderstanding of the REST API and its functionality. > Thank you [~fjodor.vershinin] great work. > [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: MireDot user activation
Good Evening Yves, I'm contacting you on behalf of the Apache Nutch project management team. Apache Nutch [0] is a top level open source software project at the Apache Software Foundation licensed under the Apache License v2.0. I am writing to request for a license key for our project. I requested one for the Apache Tika project a while back and we now release kick ass REST documentation with each release. It would be my intention to do the same with Apache Nutch if possible. Thank you in advance for any feedback you have on this one, it is greatly appreciated. Lewis John McGibbney (On behalf of the Apache Nutch Project Management Committee) [0] http://nutch.apache.org On Wed, Oct 28, 2015 at 9:56 PM, Yves Vandewoude wrote: > Hi Lewis McGibbney, > > A new MireDot account has been created for you. > > Click the url below to activate your account and select a password! > > http://people.apache.org/~lewismc || @hectorMcSpector || http://www.linkedin.com/in/lmcgibbney Apache Gora V.P || Apache Nutch PMC || Apache Any23 V.P || Apache OODT PMC Apache Open Climate Workbench PMC || Apache Tika PMC || Apache TAC Apache Usergrid || Apache HTrace (incubating) || Apache CommonsRDF (incubating)
[jira] [Created] (NUTCH-2156) Dump via Services end point
Sujen Shah created NUTCH-2156: - Summary: Dump via Services end point Key: NUTCH-2156 URL: https://issues.apache.org/jira/browse/NUTCH-2156 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Sujen Shah Fix For: 1.12 Expose the ./bin/nutch dump command via the REST api. Please review the documentation of the api design on http://docs.apachenutchrestapi.apiary.io/# and give your feedbacks. Thank you all for your inputs :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979335#comment-14979335 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43325287 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass
[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43325287 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass(LongWritable.class); + +job.setMapperClass(CrawlCompletionStatsMapper.class); +job.setReducerClass(CrawlCompletionStatsReducer.class); +job.setCombinerClass(CrawlCompletionStatsCombiner.class); +job.setNumReduceTasks(nu
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979327#comment-14979327 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324772 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass
[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility
Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324772 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { + System.out + .println("usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]"); + return 1; +} +String inputDir = args[0]; +String outputDir = args[1]; +int numOfReducers = 1; + +if (args.length > 3) { + numOfReducers = Integer.parseInt(args[3]); +} + +SimpleDateFormat sdf = new SimpleDateFormat("-MM-dd HH:mm:ss"); +long start = System.currentTimeMillis(); +LOG.info("CrawlCompletionStats: starting at " + sdf.format(start)); + +int mode = 0; +String jobName = "CrawlCompletionStats"; +if (args[2].equals("host")) { + jobName = "Host CrawlCompletionStats"; + mode = MODE_HOST; +} else if (args[2].equals("domain")) { + jobName = "Domain CrawlCompletionStats"; + mode = MODE_DOMAIN; +} + +Configuration conf = getConf(); +conf.setInt("domain.statistics.mode", mode); +conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); + +Job job = Job.getInstance(conf, jobName); +job.setJarByClass(CrawlCompletionStats.class); + +String[] inputDirsSpecs = inputDir.split(","); +for (int i = 0; i < inputDirsSpecs.length; i++) { + FileInputFormat.addInputPath(job, new Path(inputDirsSpecs[i])); +} + +job.setInputFormatClass(SequenceFileInputFormat.class); +FileOutputFormat.setOutputPath(job, new Path(outputDir)); +job.setOutputFormatClass(TextOutputFormat.class); + +job.setMapOutputKeyClass(Text.class); +job.setMapOutputValueClass(LongWritable.class); +job.setOutputKeyClass(Text.class); +job.setOutputValueClass(LongWritable.class); + +job.setMapperClass(CrawlCompletionStatsMapper.class); +job.setReducerClass(CrawlCompletionStatsReducer.class); +job.setCombinerClass(CrawlCompletionStatsCombiner.class); +job.setNumReduceTasks(nu
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979325#comment-14979325 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324656 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- +1 I agree completely @lewismc. I got a bit lazy and stole some from domainstats (which is also in need of some commons-cli love as well). I'll try to throw a patch together an address some of these issues when I get some free time. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility
Github user MJJoyce commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324656 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- +1 I agree completely @lewismc. I got a bit lazy and stole some from domainstats (which is also in need of some commons-cli love as well). I'll try to throw a patch together an address some of these issues when I get some free time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility
Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324357 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- Not a absolute requirement but merely a suggestion, it would be GREAT to see commons-cli used here to prevent incorrect CLI usage. It also prints much more user friendly output when invoked without options or with '-h'. For excellent examples of where commons-cli is already used please see [here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760) @MJJoyce. I like this job as well, it's a neat and quick way to see domain coverage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979319#comment-14979319 ] ASF GitHub Bot commented on NUTCH-2155: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/83#discussion_r43324357 --- Diff: src/java/org/apache/nutch/util/CrawlCompletionStats.java --- @@ -0,0 +1,189 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.nutch.util; + +import java.io.IOException; +import java.net.URL; +import java.text.SimpleDateFormat; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.LongWritable; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; +import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; +import org.apache.hadoop.mapreduce.Mapper; +import org.apache.hadoop.mapreduce.Reducer; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.TimingUtil; +import org.apache.nutch.util.URLUtil; + +/** + * Extracts some simple crawl completion stats from the crawldb + * + * Stats will be sorted by host/domain and will be of the form: + * 1 www.spitzer.caltech.edu FETCHED + * 50 www.spitzer.caltech.edu UNFETCHED + * + */ +public class CrawlCompletionStats extends Configured implements Tool { + + private static final Logger LOG = LoggerFactory + .getLogger(CrawlCompletionStats.class); + + private static final int MODE_HOST = 1; + private static final int MODE_DOMAIN = 2; + + private int mode = 0; + + public int run(String[] args) throws Exception { +if (args.length < 2) { --- End diff -- Not a absolute requirement but merely a suggestion, it would be GREAT to see commons-cli used here to prevent incorrect CLI usage. It also prints much more user friendly output when invoked without options or with '-h'. For excellent examples of where commons-cli is already used please see [here](https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java#L698-L760) @MJJoyce. I like this job as well, it's a neat and quick way to see domain coverage. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2155 - Add crawl completion utility
GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/83 NUTCH-2155 - Add crawl completion utility - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2155 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 2534b7a32417e044f5c1c39f4409a6d6826eee69 Author: Michael Joyce Date: 2015-10-28T21:18:16Z NUTCH-2155 - Add crawl completion util - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979258#comment-14979258 ] ASF GitHub Bot commented on NUTCH-2155: --- GitHub user MJJoyce opened a pull request: https://github.com/apache/nutch/pull/83 NUTCH-2155 - Add crawl completion utility - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MJJoyce/nutch NUTCH-2155 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 2534b7a32417e044f5c1c39f4409a6d6826eee69 Author: Michael Joyce Date: 2015-10-28T21:18:16Z NUTCH-2155 - Add crawl completion util - Add simple crawl completion utility that reports count of fetch and unfetched pages per domain or host. - Update "nutch" helper script with new utility command. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility
Michael Joyce created NUTCH-2155: Summary: Create a "crawl completeness" utility Key: NUTCH-2155 URL: https://issues.apache.org/jira/browse/NUTCH-2155 Project: Nutch Issue Type: Improvement Components: util Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.12 I've found it useful to have a tool for dumping some "completeness" information from a crawl similar to how domainstats does but including fetched and unfetched counts per domain/host. This is especially nice when doing vertical crawls over a few domains or just to see how much of a host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979196#comment-14979196 ] Michael Joyce commented on NUTCH-2155: -- Should have a first patch up shortly for review folks > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=7&rev2=8 = Class Diagram = + Below is a thumbnail Class Diagram representing the Java Class ecosystem for WebGraph. + You can click on the thumbnail for a much larger, downloadable picture. + + [[attachment:NutchWebGraph.png|{{attachment:NutchWebGraph.png||width=100}}]] +
[Nutch Wiki] New attachment added to page NewScoringIndexingExample
Dear Wiki user, You have subscribed to a wiki page "NewScoringIndexingExample" for change notification. An attachment has been added to that page by LewisJohnMcgibbney. Following detailed information is available: Attachment name: NutchWebGraph.png Attachment size: 859412 Attachment link: https://wiki.apache.org/nutch/NewScoringIndexingExample?action=AttachFile&do=get&target=NutchWebGraph.png Page link: https://wiki.apache.org/nutch/NewScoringIndexingExample
[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=6&rev2=7 '''N.B.''' This page and the functionality described within is only applicable and relevant to Nutch 1.X. + <> + + = Introduction = + Below is an example of running the new scoring and indexing systems from start to finish. This was done with a sample of 1000 urls and I ran two different fetch cycles. The first being 1000 urls and the second being the top 2000 urls. The loops job is optional but included for completeness. In production we have actually removed that job. This was done with a clean pull from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released). If anybody has any problems running these commands or has questions send me an email or send one to the nutch users or dev list and I will reply. Please send it to kubes at the apache address dot org. + = Workflow = {{{ bin/nutch inject crawl/crawldb crawl/urls/ @@ -18, +23 @@ }}} One thing to point out here is that WebGraph is meant to be used on larger web crawls to create web graphs. By default it ignores outlinks to pages in the same domain, including subdomains, and pages with the same hostname. It also limits to one outlink per page to links in the same page or the same domain. All of these options are changeable through the following configuration options: + + = Configuration = {{{ @@ -47, +54 @@ }}} + + = Additional WebGraph Classes = But by default if you are only crawling pages within a domain or within a set of subdomains, all outlinks will be ignored and you will come up with an empty webgraph. This in turn will throw an error while processing through the LinkRank job. The flip side is by NOT ignoring links to the same domain/host and by not limiting those links, the webgraph becomes much, much more dense and hence there is a lot more links to process which probably won't affect relevancy as much. @@ -163, +172 @@ bin/nutch org.apache.nutch.indexer.field.FieldIndexer -fields crawl/fields/basicfields/ -fields crawl/fields/anchorfields/ -output crawl/indexes }}} + = Class Diagram = +
[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=5&rev2=6 bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment crawl/segments/20090306093949/ -segment crawl/segments/20090306100055/ -webgraphdb crawl/webgraphdb }}} - One thing that has been brought up is the -segment flag on webgraph. If you have more than one segment then you would have more than one segment flag as shown above. + One thing that has been brought up is the -segment flag on webgraph. If you have more than one segment then you would use the -segmentDir flag available on the command line interface. {{{ bin/nutch org.apache.nutch.scoring.webgraph.Loops -webgraphdb crawl/webgraphdb/
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978963#comment-14978963 ] Aron Ahmadia commented on NUTCH-2132: - Thanks for that guidance :) > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978952#comment-14978952 ] Sujen Shah commented on NUTCH-2132: --- Yes this is taken care of in the second patch. And, apply the second patch against trunk and not on top of the first one. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978945#comment-14978945 ] Aron Ahmadia commented on NUTCH-2132: - got it. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978944#comment-14978944 ] Aron Ahmadia commented on NUTCH-2132: - I think the protection belongs in public void publish(FetcherThreadEvent event) { publisher.publish(event); } This call should only dispatch if publisher has been initialized. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978942#comment-14978942 ] Sujen Shah commented on NUTCH-2132: --- Yes the first patch does not have that property, it was incorporated in the second one. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978938#comment-14978938 ] Aron Ahmadia commented on NUTCH-2132: - I'm observing crashes when fetcher.publisher is set to false. This is with your first patch applied but not the second. fetch of http://www.google.com/ failed with: java.lang.NullPointerException at org.apache.nutch.fetcher.FetcherThreadPublisher.publish(FetcherThreadPublisher.java:43) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:253) fetch of http://aron.ahmadia.net/ failed with: java.lang.NullPointerException at org.apache.nutch.fetcher.FetcherThreadPublisher.publish(FetcherThreadPublisher.java:43) at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:253) > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978911#comment-14978911 ] Sujen Shah commented on NUTCH-2132: --- [~ahmadia], bq. One issue I'm having is that if I start a Nutch server with this patch and RMQ configured to publish, crawls fail unless a RMQ server is available. - I'll look into this and upload a new patch. Thanks for pointing this out. bq. Barring that, is it possible to reconfigure the server using the config REST endpoint to not publish - Yes, if you change the config parameter "fetcher.publisher" to false, it should stop publishing and vice versa. I have tested the case where it is false by default and configured nutch to start publishing, not tried the reverse. > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978908#comment-14978908 ] Aron Ahmadia commented on NUTCH-2132: - Also, the vice-versa situation is important as well. Can I start up a Nutch server that isn't configured to publish, then turn publishing on? > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978902#comment-14978902 ] Aron Ahmadia commented on NUTCH-2132: - [~sujenshah] - I'm reviewing this again now. One issue I'm having is that if I start a Nutch server with this patch and RMQ configured to publish, crawls fail unless a RMQ server is available. Since the point of a pub/sub model is that these sort of messages are ephemeral (and the routing exchanges may go up and down), it should not be a fatal error for RMQ to go down. Barring that, is it possible to reconfigure the server using the config REST endpoint to not publish? Testing now... > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch, PubSub_routingkey.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978854#comment-14978854 ] Chris A. Mattmann commented on NUTCH-2153: -- Yeah I think we may want to do something async here too and use GET. Let's think about this. It may be a 1.12+ improvement though. At a minimum I think we can update to GET for 1.11. > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978841#comment-14978841 ] Aron Ahmadia commented on NUTCH-2153: - If it's asynchronous, use a POST and return a crawldb_job identifier that can be used to query if the job is complete. I've got mixed feelings on the synchronous case. I'm happy to follow Mattmann's guidance on this. > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978824#comment-14978824 ] Aron Ahmadia commented on NUTCH-2154: - Looks like it's assumed that "args" is passed in to the REST query, even though it's for optional arguments. In [3]: cc.jobClient.stats() nutch.py: POST Endpoint: /db/crawldb nutch.py: POST Request data: {'args': {}, 'type': 'stats', 'crawlId': 'crawl_aahmadia_2015-10-28T13_27_31.807902', 'confId': 'default'} nutch.py: POST Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:27:36 GMT', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/json', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 200 nutch.py: Response JSON: {u'status': {u'1': {u'count': u'2', u'statusValue': u'db_unfetched'}}, u'retry 0': u'2', u'avgScore': u'1.0', u'totalUrls': u'2', u'minScore': u'1.0', u'maxScore': u'1.0'} Out[3]: {u'avgScore': u'1.0', u'maxScore': u'1.0', u'minScore': u'1.0', u'retry 0': u'2', u'status': {u'1': {u'count': u'2', u'statusValue': u'db_unfetched'}}, u'totalUrls': u'2'} Perhaps the right thing to do is write some protection around args being undefined and initialize it as an empty hash map? > Nutch REST API (DB) suffering NullPointerException > -- > > Key: NUTCH-2154 > URL: https://issues.apache.org/jira/browse/NUTCH-2154 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Assignee: Chris A. Mattmann >Priority: Minor > Labels: memex > Fix For: 1.11 > > > Not sure what's causing this. I tried this request both before and after a > crawl had completed. > nutch.py: POST Endpoint: /db/crawldb > nutch.py: POST Request data: {'type': 'stats', 'crawlId': > 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', > 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} > nutch.py: Response status: 500 > nutch log: > java.lang.NullPointerException > at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747) > at > org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95) > at > org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.htt
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978826#comment-14978826 ] Sujen Shah commented on NUTCH-2153: --- Hi [~ahmadia] and [~chrismattmann], Currently, while using Nutch REST services in local mode, the crawldb job gets executed pretty fast. But if the same is used in a distributed mode, the crawldb job can take up a fair amount of time. So issuing a GET request would make the client wait for a long time for the response. A POST request was used since the crawldb resource is created once a user issues a request and not precomputed (which is usually the case when a GET is used). The /db endpoint still requires development in the part where it can spin up threads for computation like the /job endpoint, and then provide a GET interface to query results. I have tried to use the same concept in the commoncrawldump service as that might also take up time as the amount of data crawled increases. I would like to know what are your thoughts to handle such cases, where issuing a GET requires computation of the resource. Thanks! > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978811#comment-14978811 ] Chris A. Mattmann commented on NUTCH-2154: -- I have to respin 1.11 anyways, so I'll take a look at this real quick [~ahmadia] thanks! > Nutch REST API (DB) suffering NullPointerException > -- > > Key: NUTCH-2154 > URL: https://issues.apache.org/jira/browse/NUTCH-2154 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Assignee: Chris A. Mattmann >Priority: Minor > Labels: memex > Fix For: 1.11 > > > Not sure what's causing this. I tried this request both before and after a > crawl had completed. > nutch.py: POST Endpoint: /db/crawldb > nutch.py: POST Request data: {'type': 'stats', 'crawlId': > 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', > 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} > nutch.py: Response status: 500 > nutch log: > java.lang.NullPointerException > at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747) > at > org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95) > at > org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2154: Assignee: Chris A. Mattmann > Nutch REST API (DB) suffering NullPointerException > -- > > Key: NUTCH-2154 > URL: https://issues.apache.org/jira/browse/NUTCH-2154 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Assignee: Chris A. Mattmann >Priority: Minor > Labels: memex > Fix For: 1.11 > > > Not sure what's causing this. I tried this request both before and after a > crawl had completed. > nutch.py: POST Endpoint: /db/crawldb > nutch.py: POST Request data: {'type': 'stats', 'crawlId': > 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', > 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} > nutch.py: Response status: 500 > nutch log: > java.lang.NullPointerException > at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747) > at > org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95) > at > org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2154: - Fix Version/s: 1.11 > Nutch REST API (DB) suffering NullPointerException > -- > > Key: NUTCH-2154 > URL: https://issues.apache.org/jira/browse/NUTCH-2154 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Assignee: Chris A. Mattmann >Priority: Minor > Labels: memex > Fix For: 1.11 > > > Not sure what's causing this. I tried this request both before and after a > crawl had completed. > nutch.py: POST Endpoint: /db/crawldb > nutch.py: POST Request data: {'type': 'stats', 'crawlId': > 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', > 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} > nutch.py: Response status: 500 > nutch log: > java.lang.NullPointerException > at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747) > at > org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95) > at > org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aron Ahmadia updated NUTCH-2153: Affects Version/s: (was: 1.10) 1.11 > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException
Aron Ahmadia created NUTCH-2154: --- Summary: Nutch REST API (DB) suffering NullPointerException Key: NUTCH-2154 URL: https://issues.apache.org/jira/browse/NUTCH-2154 Project: Nutch Issue Type: Bug Components: REST_api Affects Versions: 1.11 Reporter: Aron Ahmadia Priority: Minor Not sure what's causing this. I tried this request both before and after a crawl had completed. nutch.py: POST Endpoint: /db/crawldb nutch.py: POST Request data: {'type': 'stats', 'crawlId': 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'} nutch.py: POST Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 500 nutch log: java.lang.NullPointerException at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747) at org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95) at org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181) at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200) at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99) at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978769#comment-14978769 ] Chris A. Mattmann commented on NUTCH-2153: -- Gotcha, thanks [~ahmadia] > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.10 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978764#comment-14978764 ] Aron Ahmadia commented on NUTCH-2153: - The API from https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI: POST /db/crawldb with following { "type":"stats", "confId":"default", "crawlId":"crawl01", "args":{"someParam":"someValue"} } uses a POST to request information (stats). This should be a GET. > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.10 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
[ https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978748#comment-14978748 ] Chris A. Mattmann commented on NUTCH-2153: -- can you be more specific here, [~ahmadia]? > Nutch REST API (DB) uses POST instead of GET to request > --- > > Key: NUTCH-2153 > URL: https://issues.apache.org/jira/browse/NUTCH-2153 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.10 >Reporter: Aron Ahmadia >Priority: Trivial > Labels: memex > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request
Aron Ahmadia created NUTCH-2153: --- Summary: Nutch REST API (DB) uses POST instead of GET to request Key: NUTCH-2153 URL: https://issues.apache.org/jira/browse/NUTCH-2153 Project: Nutch Issue Type: Bug Components: REST_api Affects Versions: 1.10 Reporter: Aron Ahmadia Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332)