[jira] [Created] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as Full Web Graphs
Lewis John McGibbney created NUTCH-2369: --- Summary: Create a new GraphGenerator Tool for writing Nutch Records as Full Web Graphs Key: NUTCH-2369 URL: https://issues.apache.org/jira/browse/NUTCH-2369 Project: Nutch Issue Type: Task Components: graphgenerator, crawldb, hostdb, linkdb, segment, storage, tool Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.14 I've been thinking for quite some time now that a new Tool which writes Nutch data out as full graph data would be an excellent addition to the codebase. My thoughts involves writing data using Tinkerpop's ScriptInputFormat and ScriptOutputFormat's to create Vertex objects representing Nutch Crawl Records. http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html I envisage that each Vertex object would require the CrawlDB, LinkDB a Segment and possibly the HostDB in order to be fully populated. Graph characteristics e.g. Edge's would comes from those existing data structures as well. It is my intention to propose this as a GSoC project for 2017 and I have already talked offline with a potential student [~omkar20895] about him participating as the student. Essentially, if we were able to create a Graph enabling true traversal, this could be a game changer for how Nutch Crawl data is interpreted. It is my feeling that this issue most likely also involved an entire upgrade of the Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2369: Summary: Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph (was: Create a new GraphGenerator Tool for writing Nutch Records as Full Web Graphs) > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > -- > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: gsoc2017 > Fix For: 1.14 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925276#comment-15925276 ] Hudson commented on NUTCH-2357: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3412 (See [https://builds.apache.org/job/Nutch-trunk/3412/]) NUTCH-2357 Index metadata throw Exception because writable object cannot (snagel: [https://github.com/apache/nutch/commit/439f1153991ec104acdb73420ddc816cd9c665e8]) * (edit) src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925236#comment-15925236 ] Chris A. Mattmann commented on NUTCH-2357: -- Thanks [~eyeris] and [~wastl-nagel]! > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2357. -- Resolution: Fixed Solved by [~wastl-nagel] in https://github.com/apache/nutch/commit/ee559bf204448e9c658da48250e04394adf357e5 > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2357: Assignee: Chris A. Mattmann > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] nutch pull request #177: NUTCH-2357 Index metadata throw Exception because w...
Github user chrismattmann closed the pull request at: https://github.com/apache/nutch/pull/177 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Work started] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2357 started by Chris A. Mattmann. > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text
[ https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925232#comment-15925232 ] ASF GitHub Bot commented on NUTCH-2357: --- Github user chrismattmann closed the pull request at: https://github.com/apache/nutch/pull/177 > Index metadata throw Exception because writable object cannot be cast to Text > - > > Key: NUTCH-2357 > URL: https://issues.apache.org/jira/browse/NUTCH-2357 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.12 > Environment: It was detected using Linux mint 18. >Reporter: Eyeris Rodriguez Rueda >Priority: Minor > Fix For: 1.13 > > > Index Metadata plugin use this property(see below), to take keys from Datum > and index it. > > index.db.md > > > ... > > > Using any value from this property one Exception is thrown. > The problem occurs because Writable object can not be cast to Text see this > line. > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > A little change will fix it. > This is the Exception: > ** > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local1516_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ** -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (NUTCH-2068) Allow subcollection overrides via metadata
[ https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reassigned NUTCH-2068: Assignee: Markus Jelsma > Allow subcollection overrides via metadata > -- > > Key: NUTCH-2068 > URL: https://issues.apache.org/jira/browse/NUTCH-2068 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.10 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-2068.patch > > > Similar to index-metdata but overrides subcollection. If both subcollection > and index-metadata are active, you will get two values for the field possible > causing multivalued field errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Now this is odd, had to make this change but had it running with it: - crawlDelay = it.datum.getMetaData().get("_variableFetchDelay_").get(); + crawlDelay = ((LongWritable)(it.datum.getMetaData().get("_variableFetchDelay_"))).get(); Anyway,. updated patch! > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.13 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500) * 1000; > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch New patch. Removed system.out > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.13 > > Attachments: NUTCH-2368.patch, NUTCH-2368.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500) * 1000; > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
Markus Jelsma created NUTCH-2368: Summary: Variable generate.max.count and fetcher.server.delay Key: NUTCH-2368 URL: https://issues.apache.org/jira/browse/NUTCH-2368 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.12 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.13 Attachments: NUTCH-2368.patch In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls. This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator: {code} -Dgenerate.max.count.expr=' if (unfetched + fetched > 80) { return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) } else { return conf.getDouble("generate.max.count", 300); }' -Dgenerate.fetch.delay.expr=' if (unfetched + fetched > 80) { return (pct95._rs_ + 500) * 1000; } else { return conf.getDouble("fetcher.server.delay", 1000) }' {code} For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. The second expression just follows up to that, settings the crawlDelay of the fetch queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay
[ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2368: - Attachment: NUTCH-2368.patch Patch for trunk! > Variable generate.max.count and fetcher.server.delay > > > Key: NUTCH-2368 > URL: https://issues.apache.org/jira/browse/NUTCH-2368 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.13 > > Attachments: NUTCH-2368.patch > > > In some cases we need to use host specific characteristics in determining > crawl speed and bulk sizes because with our (Openindex) settings we can just > recrawl host with up to 800k urls. > This patch solves the problem by introducing the HostDB to the Generator and > providing powerful Jexl expressions. Check these two expressions added to the > Generator: > {code} > -Dgenerate.max.count.expr=' > if (unfetched + fetched > 80) { > return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + > 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1) > } else { > return conf.getDouble("generate.max.count", 300); > }' > -Dgenerate.fetch.delay.expr=' > if (unfetched + fetched > 80) { > return (pct95._rs_ + 500) * 1000; > } else { > return conf.getDouble("fetcher.server.delay", 1000) > }' > {code} > For each large host: select as many records as possible that are possible to > fetch based on number of threads, 95th percentile response time of the fetch > limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads. > The second expression just follows up to that, settings the crawlDelay of the > fetch queue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies
[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924523#comment-15924523 ] Markus Jelsma commented on NUTCH-2363: -- Thanks Sebastian - i will address your remarks later. > Fetcher support for reading and setting cookies > --- > > Key: NUTCH-2363 > URL: https://issues.apache.org/jira/browse/NUTCH-2363 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.13 > > Attachments: NUTCH-2363.patch > > > Patch adds basic support for cookies in the fetcher, and a scoring plugin > that passes cookies to its outlinks, within the domain. Sub-domain or path > based is not supported. > This is useful if you want to maintain sessions or need to get around a > cookie wall. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2367) Get single record from HostDB
Markus Jelsma created NUTCH-2367: Summary: Get single record from HostDB Key: NUTCH-2367 URL: https://issues.apache.org/jira/browse/NUTCH-2367 Project: Nutch Issue Type: Improvement Components: hostdb Affects Versions: 1.12 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.13 Attachments: NUTCH-2367.patch Introduces: {code} bin/nutch readhostdb crawl/hostdb/ -get www.apache.org {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2367) Get single record from HostDB
[ https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2367: - Attachment: NUTCH-2367.patch Patch for trunk! > Get single record from HostDB > - > > Key: NUTCH-2367 > URL: https://issues.apache.org/jira/browse/NUTCH-2367 > Project: Nutch > Issue Type: Improvement > Components: hostdb >Affects Versions: 1.12 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2367.patch > > > Introduces: > {code} > bin/nutch readhostdb crawl/hostdb/ -get www.apache.org > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)