[jira] [Created] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as Full Web Graphs

2017-03-14 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2369:
---

 Summary: Create a new GraphGenerator Tool for writing Nutch 
Records as Full Web Graphs
 Key: NUTCH-2369
 URL: https://issues.apache.org/jira/browse/NUTCH-2369
 Project: Nutch
  Issue Type: Task
  Components: graphgenerator, crawldb, hostdb, linkdb, segment, 
storage, tool
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.14


I've been thinking for quite some time now that a new Tool which writes Nutch 
data out as full graph data would be an excellent addition to the codebase.

My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
ScriptOutputFormat's to create Vertex objects representing Nutch Crawl Records. 

http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html

I envisage that each Vertex object would require the CrawlDB, LinkDB a Segment 
and possibly the HostDB in order to be fully populated. Graph characteristics 
e.g. Edge's would comes from those existing data structures as well.

It is my intention to propose this as a GSoC project for 2017 and I have 
already talked offline with a potential student [~omkar20895] about him 
participating as the student.

Essentially, if we were able to create a Graph enabling true traversal, this 
could be a game changer for how Nutch Crawl data is interpreted. It is my 
feeling that this issue most likely also involved an entire upgrade of the 
Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2017-03-14 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2369:

Summary: Create a new GraphGenerator Tool for writing Nutch Records as a 
Full Web Graph  (was: Create a new GraphGenerator Tool for writing Nutch 
Records as Full Web Graphs)

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2017
> Fix For: 1.14
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925276#comment-15925276
 ] 

Hudson commented on NUTCH-2357:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3412 (See 
[https://builds.apache.org/job/Nutch-trunk/3412/])
NUTCH-2357 Index metadata throw Exception because writable object cannot 
(snagel: 
[https://github.com/apache/nutch/commit/439f1153991ec104acdb73420ddc816cd9c665e8])
* (edit) 
src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java


> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925236#comment-15925236
 ] 

Chris A. Mattmann commented on NUTCH-2357:
--

Thanks [~eyeris] and [~wastl-nagel]!

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2357.
--
Resolution: Fixed

Solved by [~wastl-nagel] in 
https://github.com/apache/nutch/commit/ee559bf204448e9c658da48250e04394adf357e5

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2357:


Assignee: Chris A. Mattmann

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] nutch pull request #177: NUTCH-2357 Index metadata throw Exception because w...

2017-03-14 Thread chrismattmann
Github user chrismattmann closed the pull request at:

https://github.com/apache/nutch/pull/177


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Work started] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2357 started by Chris A. Mattmann.

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925232#comment-15925232
 ] 

ASF GitHub Bot commented on NUTCH-2357:
---

Github user chrismattmann closed the pull request at:

https://github.com/apache/nutch/pull/177


> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (NUTCH-2068) Allow subcollection overrides via metadata

2017-03-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-2068:


Assignee: Markus Jelsma

> Allow subcollection overrides via metadata
> --
>
> Key: NUTCH-2068
> URL: https://issues.apache.org/jira/browse/NUTCH-2068
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2068.patch
>
>
> Similar to index-metdata but overrides subcollection. If both subcollection 
> and index-metadata are active, you will get two values for the field possible 
> causing multivalued field errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2368:
-
Attachment: NUTCH-2368.patch

Now this is odd, had to make this change but had it running with it:
- crawlDelay = it.datum.getMetaData().get("_variableFetchDelay_").get();
+ crawlDelay = 
((LongWritable)(it.datum.getMetaData().get("_variableFetchDelay_"))).get();

Anyway,. updated patch!

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500) * 1000;
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2368:
-
Attachment: NUTCH-2368.patch

New patch. Removed system.out

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500) * 1000;
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2368:


 Summary: Variable generate.max.count and fetcher.server.delay
 Key: NUTCH-2368
 URL: https://issues.apache.org/jira/browse/NUTCH-2368
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.12
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.13
 Attachments: NUTCH-2368.patch

In some cases we need to use host specific characteristics in determining crawl 
speed and bulk sizes because with our (Openindex) settings we can just recrawl 
host with up to 800k urls.

This patch solves the problem by introducing the HostDB to the Generator and 
providing powerful Jexl expressions. Check these two expressions added to the 
Generator:

{code}
-Dgenerate.max.count.expr='
if (unfetched + fetched > 80) {
  return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) 
/ 1000) * conf.getInt("fetcher.threads.per.queue", 1)
} else {
  return conf.getDouble("generate.max.count", 300);
}'

-Dgenerate.fetch.delay.expr='
if (unfetched + fetched > 80) {
  return (pct95._rs_ + 500) * 1000;
} else {
  return conf.getDouble("fetcher.server.delay", 1000)
}'
{code}

For each large host: select as many records as possible that are possible to 
fetch based on number of threads, 95th percentile response time of the fetch 
limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.

The second expression just follows up to that, settings the crawlDelay of the 
fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-03-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2368:
-
Attachment: NUTCH-2368.patch

Patch for trunk!

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500) * 1000;
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2363) Fetcher support for reading and setting cookies

2017-03-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924523#comment-15924523
 ] 

Markus Jelsma commented on NUTCH-2363:
--

Thanks Sebastian - i will address your remarks later.

> Fetcher support for reading and setting cookies
> ---
>
> Key: NUTCH-2363
> URL: https://issues.apache.org/jira/browse/NUTCH-2363
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.13
>
> Attachments: NUTCH-2363.patch
>
>
> Patch adds basic support for cookies in the fetcher, and a scoring plugin 
> that passes cookies to its outlinks, within the domain. Sub-domain or path 
> based is not supported.
> This is useful if you want to maintain sessions or need to get around a 
> cookie wall.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2367) Get single record from HostDB

2017-03-14 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2367:


 Summary: Get single record from HostDB
 Key: NUTCH-2367
 URL: https://issues.apache.org/jira/browse/NUTCH-2367
 Project: Nutch
  Issue Type: Improvement
  Components: hostdb
Affects Versions: 1.12
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.13
 Attachments: NUTCH-2367.patch

Introduces:

{code}
bin/nutch readhostdb crawl/hostdb/ -get www.apache.org
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2367) Get single record from HostDB

2017-03-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2367:
-
Attachment: NUTCH-2367.patch

Patch for trunk!

> Get single record from HostDB
> -
>
> Key: NUTCH-2367
> URL: https://issues.apache.org/jira/browse/NUTCH-2367
> Project: Nutch
>  Issue Type: Improvement
>  Components: hostdb
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2367.patch
>
>
> Introduces:
> {code}
> bin/nutch readhostdb crawl/hostdb/ -get www.apache.org
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)