[jira] [Commented] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925236#comment-15925236
 ] 

Chris A. Mattmann commented on NUTCH-2357:
--

Thanks [~eyeris] and [~wastl-nagel]!

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2357.
--
Resolution: Fixed

Solved by [~wastl-nagel] in 
https://github.com/apache/nutch/commit/ee559bf204448e9c658da48250e04394adf357e5

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2357:


Assignee: Chris A. Mattmann

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Work started] (NUTCH-2357) Index metadata throw Exception because writable object cannot be cast to Text

2017-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2357 started by Chris A. Mattmann.

> Index metadata throw Exception because writable object cannot be cast to Text
> -
>
> Key: NUTCH-2357
> URL: https://issues.apache.org/jira/browse/NUTCH-2357
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: It was detected using Linux mint 18.
>Reporter: Eyeris Rodriguez Rueda
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.13
>
>
> Index Metadata plugin use this property(see below), to take keys from Datum 
> and index it.
> 
>   index.db.md
>   
>   
> ...
>   
> 
> Using any value from this property one Exception is thrown.
> The problem occurs because Writable object can not be cast to Text see this 
> line.
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> A little change will fix it.
> This is the Exception:
> **
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local1516_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>   at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>   at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> **



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2364) http.agent.rotate: IllegalArgumentException / last element of agent names ignored

2017-03-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898161#comment-15898161
 ] 

Chris A. Mattmann commented on NUTCH-2364:
--

thanks Seb appreciate it

> http.agent.rotate: IllegalArgumentException / last element of agent names 
> ignored
> -
>
> Key: NUTCH-2364
> URL: https://issues.apache.org/jira/browse/NUTCH-2364
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.10, 1.11, 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.13
>
>
> With http.agent.rotate == true and a one-element agent name list, the 
> following exception is thrown:
> {noformat}
> % cat .../conf/agents.txt
> my-test-crawler/Nutch-1.13
> % .../bin/nutch parsechecker -Dhttp.agent.rotate=true http://nutch.apache.org/
> ...
> Fetch failed with protocol status: exception(16), lastModified=0: 
> java.lang.IllegalArgumentException: bound must be positive
> % cat .../logs/hadoop.log
> ...
> 2017-03-03 11:17:19,750 ERROR http.Http - Failed to get protocol output
> java.lang.IllegalArgumentException: bound must be positive
> at 
> java.util.concurrent.ThreadLocalRandom.nextInt(ThreadLocalRandom.java:352)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getUserAgent(HttpBase.java:379)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:180)
> ...
> {noformat}
> Caused by
> {code}
> userAgentNames.get(ThreadLocalRandom.current().nextInt(userAgentNames.size()-1));
> {code}
> but nextInt(...) is defined as: "Returns a pseudorandom int value between 
> zero (inclusive) and the specified bound (exclusive)."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (NUTCH-2171) Upgrade Nutch Trunk to Java 1.8

2017-02-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2171.
--
   Resolution: Fixed
Fix Version/s: 1.13

thanks @kamaci merged into master in 3e2d3d4

> Upgrade Nutch Trunk to Java 1.8
> ---
>
> Key: NUTCH-2171
> URL: https://issues.apache.org/jira/browse/NUTCH-2171
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Chris A. Mattmann
> Fix For: 1.13
>
>
> Lambda expressions are fantastic. I tried to undertake a small exercise which 
> would indicate how many we could implement however this was a fruitless 
> effort. A patch is going to be a better approach. This task involves 
> upgrading various properties in default.properties as well as a systemic 
> source code analysis with the aim of implementing Java 8 goodies throughout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (NUTCH-2171) Upgrade Nutch Trunk to Java 1.8

2017-02-22 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2171:


Assignee: Chris A. Mattmann

> Upgrade Nutch Trunk to Java 1.8
> ---
>
> Key: NUTCH-2171
> URL: https://issues.apache.org/jira/browse/NUTCH-2171
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Chris A. Mattmann
>
> Lambda expressions are fantastic. I tried to undertake a small exercise which 
> would indicate how many we could implement however this was a fruitless 
> effort. A patch is going to be a better approach. This task involves 
> upgrading various properties in default.properties as well as a systemic 
> source code analysis with the aim of implementing Java 8 goodies throughout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2333) Indexer for RabbitMQ

2016-11-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635112#comment-15635112
 ] 

Chris A. Mattmann commented on NUTCH-2333:
--

Even more so I would recommend that [~roannel] and the proposed work be 
directly integrated as a plugin to the existing already committed NUTCH-2132. 
cc [~sujenshah]

> Indexer for RabbitMQ
> 
>
> Key: NUTCH-2333
> URL: https://issues.apache.org/jira/browse/NUTCH-2333
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.12
>Reporter: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.13
>
>
> A plugin to send the documents to a RabbitMQ server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404363#comment-15404363
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

Sujen what comes back - is success or an exception printed?

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.13
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2016-07-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359700#comment-15359700
 ] 

Chris A. Mattmann commented on NUTCH-1371:
--

Sounds fantastic! CC [~ndouba]

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.7, 2.2.1
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Attachments: NUTCH-1371-2x.patch, NUTCH-1371-plugins.trunk.patch, 
> NUTCH-1371-pom.patch, NUTCH-1371-r1461140.patch, NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2248) CSS parser plugin

2016-05-16 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2248.
--
Resolution: Fixed

Thanks [~naegelejd] and [~lewismc] for the work!

{noformat}
LMC-053601:nutch1.12 mattmann$ git push -u origin master
Counting objects: 35, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (23/23), done.
Writing objects: 100% (35/35), 7.69 KiB | 0 bytes/s, done.
Total 35 (delta 15), reused 0 (delta 0)
remote: nutch git commit: Update CHANGES.txt for NUTCH-2248 CSS Parser plugin 
contributed by Joseph Naegele.
remote: nutch git commit: Merge branch 'NUTCH-2248' of 
https://github.com/naegelejd/nutch
remote: nutch git commit: NUTCH-2248 CSS Parser plugin parse-css
To https://git-wip-us.apache.org/repos/asf/nutch.git
   dce7a28..6b8586a  master -> master
Branch master set up to track remote branch master from origin.
LMC-053601:nutch1.12 mattmann$ 
{noformat}


> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2248) CSS parser plugin

2016-05-16 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2248:
-
Fix Version/s: 1.12

> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2248) CSS parser plugin

2016-05-16 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2248:
-
Affects Version/s: (was: 1.12)

> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2248) CSS parser plugin

2016-05-16 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2248 started by Chris A. Mattmann.

> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2248) CSS parser plugin

2016-05-16 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2248:


Assignee: Chris A. Mattmann

> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2252) Allow phantomjs as a browser for selenium options

2016-05-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2252 started by Chris A. Mattmann.

> Allow phantomjs as a browser for selenium options
> -
>
> Key: NUTCH-2252
> URL: https://issues.apache.org/jira/browse/NUTCH-2252
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.12
>
>
> Adding phantomjs libraries to lib-selenium so you can choose this as a 
> browser with the selenium option



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2252) Allow phantomjs as a browser for selenium options

2016-05-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2252:


Assignee: Chris A. Mattmann  (was: Lewis John McGibbney)

> Allow phantomjs as a browser for selenium options
> -
>
> Key: NUTCH-2252
> URL: https://issues.apache.org/jira/browse/NUTCH-2252
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.12
>
>
> Adding phantomjs libraries to lib-selenium so you can choose this as a 
> browser with the selenium option



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2250) CommonCrawlDumper : Invalid format + skipped parts

2016-04-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2250:
-
Affects Version/s: (was: 1.12)
   1.10

>  CommonCrawlDumper : Invalid format + skipped parts
> ---
>
> Key: NUTCH-2250
> URL: https://issues.apache.org/jira/browse/NUTCH-2250
> Project: Nutch
>  Issue Type: Sub-task
>  Components: commoncrawl
>Affects Versions: 1.10
> Environment: Linux x64
> Java 7
> Nutch 1.12
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> The following issues are found with CommonCrawlDumper;
> 1. Documents get duplicated in dump files
> How to reproduce 
> {code}
> bin/nutch commoncrawldump  -segment .../segments -outputDir testdump 
> -SimpleDateFormat -epochFilename -jsonArray -reverseKey
> {code}
> The first ever written will contain 1 document.
> second file includes two documents
> third file includes first three documents and this grows linearly.
> 2.If a segment has many parts (part-0, part-1,...) only the first 
> part (part-0 ) is being dumped
> How to reproduce ?
> Create segment with two parts (part-0 and part-1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2250) CommonCrawlDumper : Invalid format + skipped parts

2016-04-17 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2250.
--
   Resolution: Fixed
Fix Version/s: (was: 1.10)
   1.12

- merged this into master thanks [~thammegowda] and [~lewismc]!
{noformat}
LMC-053601:nutch1.12 mattmann$ git push -u origin master
Counting objects: 13, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (13/13), 1.84 KiB | 0 bytes/s, done.
Total 13 (delta 8), reused 0 (delta 0)
remote: nutch git commit: Record changes for NUTCH-2250.
remote: nutch git commit: NUTCH-2250 : CommonCrawlDumper : Invalid format and 
skipped parts
To https://git-wip-us.apache.org/repos/asf/nutch.git
   b62f43f..d6bcefd  master -> master
Branch master set up to track remote branch master from origin.
LMC-053601:nutch1.12 mattmann$ 
{noformat}


>  CommonCrawlDumper : Invalid format + skipped parts
> ---
>
> Key: NUTCH-2250
> URL: https://issues.apache.org/jira/browse/NUTCH-2250
> Project: Nutch
>  Issue Type: Sub-task
>  Components: commoncrawl
>Affects Versions: 1.10
> Environment: Linux x64
> Java 7
> Nutch 1.12
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> The following issues are found with CommonCrawlDumper;
> 1. Documents get duplicated in dump files
> How to reproduce 
> {code}
> bin/nutch commoncrawldump  -segment .../segments -outputDir testdump 
> -SimpleDateFormat -epochFilename -jsonArray -reverseKey
> {code}
> The first ever written will contain 1 document.
> second file includes two documents
> third file includes first three documents and this grows linearly.
> 2.If a segment has many parts (part-0, part-1,...) only the first 
> part (part-0 ) is being dumped
> How to reproduce ?
> Create segment with two parts (part-0 and part-1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2250) CommonCrawlDumper : Invalid format + skipped parts

2016-04-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2250 started by Chris A. Mattmann.

>  CommonCrawlDumper : Invalid format + skipped parts
> ---
>
> Key: NUTCH-2250
> URL: https://issues.apache.org/jira/browse/NUTCH-2250
> Project: Nutch
>  Issue Type: Sub-task
>  Components: commoncrawl
>Affects Versions: 1.12
> Environment: Linux x64
> Java 7
> Nutch 1.12
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
> Fix For: 1.10
>
>
> The following issues are found with CommonCrawlDumper;
> 1. Documents get duplicated in dump files
> How to reproduce 
> {code}
> bin/nutch commoncrawldump  -segment .../segments -outputDir testdump 
> -SimpleDateFormat -epochFilename -jsonArray -reverseKey
> {code}
> The first ever written will contain 1 document.
> second file includes two documents
> third file includes first three documents and this grows linearly.
> 2.If a segment has many parts (part-0, part-1,...) only the first 
> part (part-0 ) is being dumped
> How to reproduce ?
> Create segment with two parts (part-0 and part-1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2191) Add protocol-htmlunit

2016-04-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2191:
-
Labels: memex  (was: )

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch, 
> NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-03-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212425#comment-15212425
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

approved, please update the PR [~karanjeets] and I will work to commit. thanks 
for addressing [~markus17]'s comments.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2241.
--
Resolution: Fixed

Merged, thanks [~karanjeets]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git pull 
https://github.com/karanjeets/nutch/ NUTCH-2241
remote: Counting objects: 18, done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 18 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (18/18), done.
>From https://github.com/karanjeets/nutch
 * branchNUTCH-2241 -> FETCH_HEAD
Updating a3e7420..a9b2491
Fast-forward
 CHANGES.txt
|  2 ++
 conf/nutch-default.xml 
| 50 ++
 
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 | 52 
 3 files changed, 88 insertions(+), 16 deletions(-)
[chipotle:~/tmp/nutch1.12] mattmann% git branch
  2.x
  NUTCH-2213
* master
  merge-branch
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 96, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (18/18), 2.53 KiB | 0 bytes/s, done.
Total 18 (delta 9), reused 0 (delta 0)
remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets
remote: nutch git commit: fix for NUTCH-2241 contributed by karanjeets
To https://git-wip-us.apache.org/repos/asf/nutch.git
   a3e7420..a9b2491  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2241 started by Chris A. Mattmann.

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2241:
-
Labels: firefox interactiveselenium lib-selenium memex nutch 
nutch-default.xml plugin protocol selenium  (was: firefox interactiveselenium 
lib-selenium nutch nutch-default.xml plugin protocol selenium)

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2241) Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration

2016-03-19 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2241:


Assignee: Chris A. Mattmann

> Unstable Selenium plugin in Nutch. Fixed bugs and enhanced configuration
> 
>
> Key: NUTCH-2241
> URL: https://issues.apache.org/jira/browse/NUTCH-2241
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.12
> Environment: Fixed for Firefox browser with version 25 and above.
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: firefox, interactiveselenium, lib-selenium, memex, 
> nutch, nutch-default.xml, plugin, protocol, selenium
> Fix For: 1.12
>
>
> Issues:
> (a) Firefox browser doesn't close gracefully.
> (b) The property libselenium.page.load.delay is not working. No matter how 
> much delay you give, the driver is not waiting for the page to load.
> (c) There is no timeout configured for the firefox binary.
> (d) A lot of selenium configuration is hard-coded which can be exposed 
> through nutch-default.xml or nutch-site.xml
> All these issues are part of "lib-selenium" plugin which is being used by two 
> other protocols "protocol-selenium" and "protocol-interactiveselenium".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-03-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196088#comment-15196088
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

thanks [~karanjeets] and [~markus17]

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2191) Add protocol-htmlunit

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2191 started by Chris A. Mattmann.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-03-14 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192834#comment-15192834
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

thanks [~karanjeets] I'll take a look tomorrow. I think we're close and we can 
push this in.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2191) Add protocol-htmlunit

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2191:


Assignee: Chris A. Mattmann  (was: Markus Jelsma)

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2239 started by Chris A. Mattmann.

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2239:
-
Labels: memex  (was: )

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2239:
-
Component/s: protocol
 fetcher

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, protocol
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2239:


Assignee: Chris A. Mattmann

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2239) Selenium Handlers for Ajax Patterns from Student submissions

2016-03-14 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2239:
-
Fix Version/s: 1.12

> Selenium Handlers for Ajax Patterns from Student submissions
> 
>
> Key: NUTCH-2239
> URL: https://issues.apache.org/jira/browse/NUTCH-2239
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Raghav Bharadwaj Jayasimha Rao
>Assignee: Chris A. Mattmann
> Fix For: 1.12
>
>
> - Refactor student submissions from USC class of CSCI 572 to obtain a 
> comprehensive set of selenium handlers for various Ajax Patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192488#comment-15192488
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

Example of this in action in MEMEX-Explorer: 
https://github.com/memex-explorer/nutch-python/pull/15
Another example in MEMEX-Explorer: 
https://github.com/memex-explorer/memex-explorer/pull/720#issuecomment-150004911

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186086#comment-15186086
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

agreed - I will try and generalize it and then update for review.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184364#comment-15184364
 ] 

Chris A. Mattmann commented on NUTCH-2132:
--

[~sujenshah] can we get this committed? This is a significant improvement and 
allows us to build awesome real time UIs for Nutch. If I don't hear any more 
comments I will commit the latest version of this PR in 24 hours.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2132 started by Chris A. Mattmann.

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2016-03-07 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2132:


Assignee: Chris A. Mattmann

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch, NUTCH-2132.v2.patch, 
> PubSub_routingkey.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2213.
--
   Resolution: Fixed
Fix Version/s: 1.12

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173188#comment-15173188
 ] 

Chris A. Mattmann commented on NUTCH-2213:
--

Fixed thanks [~jnioche]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 132, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (20/20), 1.98 KiB | 0 bytes/s, done.
Total 20 (delta 10), reused 0 (delta 0)
To https://git-wip-us.apache.org/repos/asf/nutch.git
   15c583e..a3e7420  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2213 started by Chris A. Mattmann.

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2144.
--
Resolution: Fixed

OK all fixed thanks [~thammegowda]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 224, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (51/51), 10.10 KiB | 0 bytes/s, done.
Total 51 (delta 25), reused 0 (delta 0)
To https://git-wip-us.apache.org/repos/asf/nutch.git
   f5e430e..15c583e  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-02-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152420#comment-15152420
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Markus, we don't need to fix the plugin dependency broader issue. We should 
just focus here on NUTCH-2191 and for that matter I agree with Karanjeet on his 
solution for part #1 aka creating a new lib-htmlunit library and changing the 
dependency to it. For #2 please try again by rebuliding - it should work.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2046) The crawl script should be able to skip an initial injection.

2016-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142858#comment-15142858
 ] 

Chris A. Mattmann commented on NUTCH-2046:
--

+1 from me

> The crawl script should be able to skip an initial injection.
> -
>
> Key: NUTCH-2046
> URL: https://issues.apache.org/jira/browse/NUTCH-2046
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, injector
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: crawl, injection
> Fix For: 1.12
>
> Attachments: crawl.patch
>
>
> When our crawl gets really big a new injection takes considerable time as it 
> updates crawldb, the crawl script should be able to skip the injection and go 
> directly to the generate call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140950#comment-15140950
 ] 

Chris A. Mattmann commented on NUTCH-2213:
--

Hi [~jrsr] thanks for the issue request. You may also want to have a look at:

http://wiki.apache.org/nutch/CommonCrawlDataDumper

It's probably worth noting here too that the tool "scaling" is probably in the 
eyes of the beholder. We regularly use the tool a ton in my team at NASA JPL to 
dump loads of data (terabytes) from Nutch crawls, etc. It takes up a bunch of 
memory and isn't necessarily as fast as it could be (as [~jnioche] noted). The 
main reason is that it needs to implement Map Reduce and right now the tool 
does everything on the head node. If you're interested in the tool or if you 
find it useful, we would be happy to work with you and/or anyone to port it to 
Map Reduce which would be trivial.

Cheers,
Chris


> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2144:


Assignee: Chris A. Mattmann

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2144 started by Chris A. Mattmann.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141172#comment-15141172
 ] 

Chris A. Mattmann commented on NUTCH-2144:
--

I am +1 for this patch, and enabled only by the user (and not by default). This 
is a critical patch for us in MEMEX and I think it adds a lot of value here to 
the community.

[~lewismc] and I will work to get this committed in the next 48 hours. Thank 
you [~thammegowda]!

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141293#comment-15141293
 ] 

Chris A. Mattmann commented on NUTCH-2144:
--

Agreed and agreed. Thamme can you submit a new version of the patch/pull 
request as Lewis suggests using just suffix checking.

Thamme - if you turn off MIME magic in the tika config, then it will default to 
glob pattern and URL regex matching. However, I wouldn't even bother with it in 
this case, and just doing a simple URL/regex check in Nutch will satisfy the 
speed gains.

Looking forward to the new version of the PR.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128368#comment-15128368
 ] 

Chris A. Mattmann commented on NUTCH-1314:
--

Otis, your patches are always welcome! :)

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file

2016-01-26 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118689#comment-15118689
 ] 

Chris A. Mattmann commented on NUTCH-2206:
--

+1 please commit

> Provide example scoring.similarity.stopword.file
> 
>
> Key: NUTCH-2206
> URL: https://issues.apache.org/jira/browse/NUTCH-2206
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, scoring
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2206.patch, NUTCH-2206.patch
>
>
> The scoring-similarity plugin does not provide an example file for the 
> property scoring.similarity.stopword.file.
> This is an issue for a number of reasons, namely 
>  * A user does not know what it is meant to look like, and
>  * We always check of this file and will [throw an exception if it is not 
> found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80],
>  this may not be picked up by the user until much later.
> I suggest a simple fix here, simply include the [standard English stop words 
> taken from Lucene's 
> StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt].
>  The comments will help people to easily customize the list to whatever they 
> require. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089545#comment-15089545
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Markus thanks! Check out:
https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium
 

and the handlers there

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083214#comment-15083214
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Very nice, Markus! Beat me to implementing this one.

My suggestions for future work here are to implement 10-15 Ajax pattern 
handlers like we started to do with selenium.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059016#comment-15059016
 ] 

Chris A. Mattmann commented on NUTCH-2184:
--

+1

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15055417#comment-15055417
 ] 

Chris A. Mattmann commented on NUTCH-2184:
--

Nice, bruh

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035902#comment-15035902
 ] 

Chris A. Mattmann commented on NUTCH-2172:
--

bq. This could be an improvement if we assume that MIME types do not contain 
white space

This is not a safe assumption on the Internet. We see all the time in crawls 
that web servers return MIME type with white space.

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Priority: Minor
>  Labels: easyfix, newbie
> Attachments: NUTCH-2172-1.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2162) Nutch Webapp Crawl fails as it tries to index

2015-11-06 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994029#comment-14994029
 ] 

Chris A. Mattmann commented on NUTCH-2162:
--

so I tried this out. It actually works fine as long as you have everything the 
default, e.g., if you install solr on 8983, and you install the Nutch schema in 
that solr and by default you install it into collection 1. I have it fully 
working with that config. It's brittle but doesn't require a code update and it 
works. 

One other thing to note - you can't change properties (yet) from the Nutch 
config, so you *must* update http.agent.name to something in your 
runtime/*/conf/nutch-{site|default}.xml file before starting the web services 
REST layer and using the Wicket App.

One other thing we should think about - Maven - and then Maven WAR overlays 
here once we get a version of Nutch working with Maven.

> Nutch Webapp Crawl fails as it tries to index
> -
>
> Key: NUTCH-2162
> URL: https://issues.apache.org/jira/browse/NUTCH-2162
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: nutch_webapp.log
>
>
> Right now a crawl task fails on the trunk version of the WebApp due to it 
> attempting to index. No indexer is defined by default so this is a major bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-03 Thread Chris A. Mattmann (JIRA)
Chris A. Mattmann created NUTCH-2158:


 Summary: Upgrade to Tika 1.11
 Key: NUTCH-2158
 URL: https://issues.apache.org/jira/browse/NUTCH-2158
 Project: Nutch
  Issue Type: Task
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.11


Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984432#comment-14984432
 ] 

Chris A. Mattmann commented on NUTCH-2155:
--

Seb, shall we update it not to require current and then move forward? Thoughts? 
[~mjoyce]?

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984433#comment-14984433
 ] 

Chris A. Mattmann commented on NUTCH-2150:
--

Again - the solution here is to remove the need for current?

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983442#comment-14983442
 ] 

Chris A. Mattmann commented on NUTCH-2154:
--

[~sujenshah] I'd like to spin 1.11 RC #2 today. Can you look at this?

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2150.
--
Resolution: Fixed

thanks Mike!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2150 Add 
ProtocolStatus Utility contributed by Michael Joyce  this 
closes #82."
SendingCHANGES.txt
Sendingsrc/bin/crawl
Sendingsrc/bin/nutch
Adding src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
Transmitting file data ...
Committed revision 1711562.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2146) hashCode on the Outlink class

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2146:
-
Fix Version/s: 1.11

> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2155.
--
Resolution: Fixed

Thanks Mike!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2155 Create a 
crawl completeness utility contributed by Michael Joyce  
this closes #83"
SendingCHANGES.txt
Sendingsrc/bin/nutch
Adding src/java/org/apache/nutch/util/CrawlCompletionStats.java
Transmitting file data ...
Committed revision 1711560.
{noformat}


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2146) hashCode on the Outlink class

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2146:


Assignee: Chris A. Mattmann

> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2146) hashCode on the Outlink class

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2146 started by Chris A. Mattmann.

> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2146) hashCode on the Outlink class

2015-10-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983402#comment-14983402
 ] 

Chris A. Mattmann commented on NUTCH-2146:
--

Going to commit this shortly.

> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2150 started by Chris A. Mattmann.

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2150:


Assignee: Chris A. Mattmann

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2150:
-
Fix Version/s: (was: 1.12)
   1.11

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2155 started by Chris A. Mattmann.

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2155:
-
Fix Version/s: (was: 1.12)
   1.11

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2155:


Assignee: Chris A. Mattmann

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2155:
-
Labels: memex  (was: )

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2154.
--
Resolution: Fixed

Thanks Sujen! Thanks Aron!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2154 Nutch 
REST API (DB) suffering NullPointerException contributed by Sujen Shah."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/service/model/request/DbQuery.java
Sendingsrc/java/org/apache/nutch/service/resources/DbResource.java
Transmitting file data ...
Committed revision 1711565.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2154.patch
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> 

[jira] [Resolved] (NUTCH-1800) Documentation for Nutch 1.X and 2.X REST APIs

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-1800.
--
Resolution: Fixed

this is done, thanks Lewis

> Documentation for Nutch 1.X and 2.X REST APIs
> -
>
> Key: NUTCH-1800
> URL: https://issues.apache.org/jira/browse/NUTCH-1800
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, REST_api
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-1800.patch
>
>
> This issue should build on NUTCH-1769 with full Java documentation for all 
> classes in the following packages
> org.apache.nutch.api.*
> I am assigning this one to [~fjodor.vershinin] as he is doing an excellent 
> job on the REST API. His UML graphic in [0] and commantary shows that he has 
> a goo dunderstanding of the REST API and its functionality.
> Thank you [~fjodor.vershinin] great work.
> [0] https://wiki.apache.org/nutch/NutchRESTAPI#UML_Graphic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2146) hashCode on the Outlink class

2015-10-30 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2146.
--
Resolution: Fixed

Thanks [~jorgelbg]!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2146 hashCode 
on the Outlink class contributed by Jorge Luis Betancourt 
 this closes #79."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/parse/Outlink.java
Adding src/plugin/index-links/src/test/org/apache/nutch/parse
Adding 
src/plugin/index-links/src/test/org/apache/nutch/parse/TestOutlinks.java
Transmitting file data ...
Committed revision 1711561.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978748#comment-14978748
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

can you be more specific here, [~ahmadia]?

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978854#comment-14978854
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

Yeah I think we may want to do something async here too and use GET. Let's 
think about this. It may be a 1.12+ improvement though. At a minimum I think we 
can update to GET for 1.11.

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2154:
-
Fix Version/s: 1.11

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2154:


Assignee: Chris A. Mattmann

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2153) Nutch REST API (DB) uses POST instead of GET to request

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978769#comment-14978769
 ] 

Chris A. Mattmann commented on NUTCH-2153:
--

Gotcha, thanks [~ahmadia]

> Nutch REST API (DB) uses POST instead of GET to request
> ---
>
> Key: NUTCH-2153
> URL: https://issues.apache.org/jira/browse/NUTCH-2153
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.10
>Reporter: Aron Ahmadia
>Priority: Trivial
>  Labels: memex
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2154) Nutch REST API (DB) suffering NullPointerException

2015-10-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978811#comment-14978811
 ] 

Chris A. Mattmann commented on NUTCH-2154:
--

I have to respin 1.11 anyways, so I'll take a look at this real quick 
[~ahmadia] thanks!

> Nutch REST API (DB) suffering NullPointerException
> --
>
> Key: NUTCH-2154
> URL: https://issues.apache.org/jira/browse/NUTCH-2154
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Not sure what's causing this.  I tried this request both before and after a 
> crawl had completed.
> nutch.py: POST Endpoint: /db/crawldb
> nutch.py: POST Request data: {'type': 'stats', 'crawlId': 
> 'crawl_aahmadia_2015-10-28T13_17_15.034351', 'confId': 'default'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'Date': 'Wed, 28 Oct 2015 17:18:54 GMT', 
> 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
> nutch.py: Response status: 500
> nutch log:
> java.lang.NullPointerException
>   at org.apache.nutch.crawl.CrawlDbReader.query(CrawlDbReader.java:747)
>   at 
> org.apache.nutch.service.resources.DbResource.crawlDbStats(DbResource.java:95)
>   at 
> org.apache.nutch.service.resources.DbResource.readdb(DbResource.java:52)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
>   at 
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
>   at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>   at 
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>   at 
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
>   at 
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>   at 
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
>   at 
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>   at org.eclipse.jetty.server.Server.handle(Server.java:370)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
>   at 
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>   at 
> org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2147) LanguagePreferenceScoringFilter for Nutch

2015-10-25 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2147:
-
Fix Version/s: (was: 1.11)
   1.12

> LanguagePreferenceScoringFilter for Nutch
> -
>
> Key: NUTCH-2147
> URL: https://issues.apache.org/jira/browse/NUTCH-2147
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin, scoring
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Based on the implementation of a LanguagePreferenceScoringFilter Nutch could 
> easily be made into a directed crawler based on crawl administrator ranking 
> preferences of languages we wish to crawl. 
> Right now this is not possible.
> We already detect and index language within the language-identifier plugin as 
> well as within parse-tika irrc, however currently the presence of a language 
> does not effect scoring of pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2149) REST endpoint to read Nutch sequence files

2015-10-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973368#comment-14973368
 ] 

Chris A. Mattmann commented on NUTCH-2149:
--

in your commit msg for the future [~sujenshah] reference the Github issue (aka 
say "this closes #80") in your commit message and asfgit user will close the 
issue on Github for ya.

> REST endpoint to read Nutch sequence files
> --
>
> Key: NUTCH-2149
> URL: https://issues.apache.org/jira/browse/NUTCH-2149
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
>
> This endpoint enables reading of the webgraph data like nodes, links and any 
> other sequence file in the Nutch ecosystem via a RESTful interface. 
> The current API documentation for this Reader endpoint is available at - 
> http://docs.nutchpytonutchrestapi.apiary.io/
> Thanks to https://github.com/ContinuumIO/nutchpy for the initial work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2133) Transfer Selenium Documentation to WIki

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2133:
-
Fix Version/s: (was: 1.11)
   (was: 2.4)
   1.12

> Transfer Selenium Documentation to WIki
> ---
>
> Key: NUTCH-2133
> URL: https://issues.apache.org/jira/browse/NUTCH-2133
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> There's a decent chunk of Selenium related documentation stuck in READMEs for 
> various plugins. I would be nice to get this stuff pushed to the wiki.
> E.G.: 
> https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2030:
-
Fix Version/s: (was: 1.11)
   1.12

> ParseZip plugin is not able to extract language from zip document,this could 
> solve that problem.
> 
>
> Key: NUTCH-2030
> URL: https://issues.apache.org/jira/browse/NUTCH-2030
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
> Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.12
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Actually parse-zip plugin don´t extract language from zip document, therefore 
> lang field is empty in solr or elastic. If the package(.zip) contains a list 
> of documents so the lang field could be multivalued to support that list of 
> languages. A simple change to parse-zip pluging could fix this problem. I 
> will use Language Identifier class from tika and analyze each document inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2086) Nutch 1.X Webui

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2086:
-
Fix Version/s: (was: 1.11)
   1.12

> Nutch 1.X Webui 
> 
>
> Key: NUTCH-2086
> URL: https://issues.apache.org/jira/browse/NUTCH-2086
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2086.patch
>
>
> To port the Apache Wicket based webui in Nutch 2.X to 1.X



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2135) Ant Eclipse build does not include protocol-interactiveselenium

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2135:
-
Fix Version/s: (was: 1.11)
   1.12

> Ant Eclipse build does not include protocol-interactiveselenium
> ---
>
> Key: NUTCH-2135
> URL: https://issues.apache.org/jira/browse/NUTCH-2135
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Reporter: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.12
>
>
> target eclipse in the build.xml file does not include 
> protocol-interactiveselenium so while importing the project into eclipse, it 
> does not add that folder.  
> On adding that to the build file, I found that eclipse throws errors as the 
> package naming in classes belonging to the 
> org.apache.nutch.protocol.interactiveselenium.handlers is incomplete. 
> Have made both those changes in this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2132:
-
Fix Version/s: (was: 1.11)
   1.12

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2140) Atomic update and optimistic concurrency update using Solr

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2140:
-
Fix Version/s: (was: 1.11)
   1.12

> Atomic update and optimistic concurrency update using Solr
> --
>
> Key: NUTCH-2140
> URL: https://issues.apache.org/jira/browse/NUTCH-2140
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.9
>Reporter: Roannel Fernández Hernández
> Fix For: 1.12
>
>
> The SOLRIndexWriter plugin allows to index the documents into a Solr server. 
> The plugin replaces the documents that already are indexed into Solr. 
> Sometimes, replace only one field or add new fields and keep the others 
> values of the documents indexed is useful.
> Solr supports two approaches for this task: Atomic update and optimistic 
> concurrency update. However, the SOLRIndexWriter plugin doesn't support that 
> approaches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2120:
-
Fix Version/s: (was: 1.11)
   1.12

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2141 started by Chris A. Mattmann.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2128) Refactor configuration end point

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2128:
-
Fix Version/s: (was: 1.11)
   1.12

> Refactor configuration end point
> 
>
> Key: NUTCH-2128
> URL: https://issues.apache.org/jira/browse/NUTCH-2128
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
> Fix For: 1.12
>
>
> To better define the endpoint to create a new configuration and add a new 
> endpoint to update a particular property value of a configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1943) Form authentication should not be global and ignore

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1943:
-
Fix Version/s: (was: 1.11)
   1.12

> Form authentication should not be global and ignore 
> ---
>
> Key: NUTCH-1943
> URL: https://issues.apache.org/jira/browse/NUTCH-1943
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.12
>
>
> Taken from [~wastl-nagel]'s comments on NUTCH-827
> bq. the form authentication is global and ignores . So you have to 
> restrict your crawl to the form authentication pages only. Ideally, also form 
> authentication should be bound to a scope (one host, one URL prefix, etc.) 
> same as HTTP authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2064:
-
Fix Version/s: (was: 1.11)
   1.12

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2141.
--
   Resolution: Fixed
Fix Version/s: 1.11

Thanks [~BalaJira] [~jo...@apache.org] plenty to  improve on but a great start!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2141: Change 
the InteractiveSelenium plugin handler Interface to return page content 
contributed by Balaji  this closes #77 #75"
SendingCHANGES.txt
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java
Transmitting file data ..
Committed revision 1709307.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
> Fix For: 1.11
>
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2142:


Assignee: Chris A. Mattmann

> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   >