[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2018-04-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454287#comment-16454287
 ] 

Markus Jelsma commented on NUTCH-2573:
--

Sounds like a good idea!

> Suspend crawling if robots.txt fails to fetch with 5xx status
> -
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable 
> interval when fetching the robots.txt fails with a server errors (HTTP status 
> code 5xx, esp. 503) following [Google's spec| 
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" 
> of crawling. The request is retried until a non-server-error HTTP result code 
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
> retrying. To temporarily suspend crawling, it is recommended to serve a 503 
> HTTP result code. Handling of a permanent server error is undefined.??
> Crawler-commons robots rules already provide 
> [isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
>  to store this information (must be set from RobotRulesParser).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2573:
---
Description: 
Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

Crawler-commons robots rules already provide 
[isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (must be set from RobotRulesParser).

  was:
Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

Crawler-commons robots rules already provide 
[isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (set from RobotRulesParser).


> Suspend crawling if robots.txt fails to fetch with 5xx status
> -
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable 
> interval when fetching the robots.txt fails with a server errors (HTTP status 
> code 5xx, esp. 503) following [Google's spec| 
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" 
> of crawling. The request is retried until a non-server-error HTTP result code 
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
> retrying. To temporarily suspend crawling, it is recommended to serve a 503 
> HTTP result code. Handling of a permanent server error is undefined.??
> Crawler-commons robots rules already provide 
> [isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
>  to store this information (must be set from RobotRulesParser).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2018-04-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454277#comment-16454277
 ] 

Sebastian Nagel commented on NUTCH-2573:


Note: the [current 
implementation|https://github.com/apache/nutch/blob/620b85df36d0c802f333a56ca1ef7021a7935360/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java#L161]
 handles 5xx errors as "allow all" (no robots.txt present).

> Suspend crawling if robots.txt fails to fetch with 5xx status
> -
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable 
> interval when fetching the robots.txt fails with a server errors (HTTP status 
> code 5xx, esp. 503) following [Google's spec| 
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" 
> of crawling. The request is retried until a non-server-error HTTP result code 
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
> retrying. To temporarily suspend crawling, it is recommended to serve a 503 
> HTTP result code. Handling of a permanent server error is undefined.??
> Crawler-commons robots rules already provide 
> [isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
>  to store this information (set from RobotRulesParser).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

2018-04-26 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2573:
--

 Summary: Suspend crawling if robots.txt fails to fetch with 5xx 
status
 Key: NUTCH-2573
 URL: https://issues.apache.org/jira/browse/NUTCH-2573
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.14
Reporter: Sebastian Nagel
 Fix For: 1.15


Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

Crawler-commons robots rules already provide 
[isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (set from RobotRulesParser).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2572) HostDb: updatehostdb does not set values

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453909#comment-16453909
 ] 

Hudson commented on NUTCH-2572:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2572 HostDb: updatehostdb does not set values - unwrap (snagel: 
[https://github.com/apache/nutch/commit/e94ee4803481995270a066e28cf88f80ff8bf468])
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java


> HostDb: updatehostdb does not set values
> 
>
> Key: NUTCH-2572
> URL: https://issues.apache.org/jira/browse/NUTCH-2572
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> {noformat}
> % bin/nutch readdb crawl/crawldb -stats -sort
> ...
> status 1 (db_unfetched):3
>nutch.apache.org :   3
> status 2 (db_fetched):  2
>nutch.apache.org :   2
> status 6 (db_notmodified):  34
>nutch.apache.org :   34
> CrawlDb statistics: done
> % bin/nutch updatehostdb -hostdb  crawl/hostdb -crawldb crawl/crawldb
> UpdateHostDb: hostdb: crawl/hostdb
> UpdateHostDb: crawldb: crawl/crawldb
> UpdateHostDb: starting at 2018-04-23 13:50:33
> UpdateHostDb: finished at 2018-04-23 13:50:35, elapsed: 00:00:01
> % bin/nutch readhostdb crawl/hostdb -get nutch.apache.org
> ReadHostDb: get: nutch.apache.org
> 0   0   0   0   0   0   0   0   0   0 
>   0.0 1970-01-01 01:00:00
> {noformat}
> Although a HostDb record is added for "nutch.apache.org", all expected values 
> (number of fetched/unfetched/... pages, fetch time 
> min/max/average/percentiles, etc.) are empty or zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2544) Nutch 1.15 no longer compatible with AWS EMR and S3

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453903#comment-16453903
 ] 

Hudson commented on NUTCH-2544:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2544 Nutch 1.15 no longer compatible with AWS EMR and S3 - use (snagel: 
[https://github.com/apache/nutch/commit/ac02d82af823b623c02167e543571e2c16a289be])
* (edit) src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java


> Nutch 1.15 no longer compatible with AWS EMR and S3
> ---
>
> Key: NUTCH-2544
> URL: https://issues.apache.org/jira/browse/NUTCH-2544
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, generator
>Affects Versions: 1.15
>Reporter: Steven W
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> Nutch 1.14 is working OK with AWS EMR and S3 storage, but NUTCH-2375 appears 
> to have broken this.
> Generator partitioning fails with Error: java.lang.NullPointerException at 
> org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:75)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2526) NPE in scoring-opic when indexing document without CrawlDb datum

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453907#comment-16453907
 ] 

Hudson commented on NUTCH-2526:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2526 NPE in scoring-opic when indexing document without CrawlDb (snagel: 
[https://github.com/apache/nutch/commit/90ae2d1f9159c3d30d5a937252f2bbb00e2110e4])
* (edit) 
src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
* (edit) src/java/org/apache/nutch/scoring/ScoringFilter.java
* (edit) 
src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java


> NPE in scoring-opic when indexing document without CrawlDb datum
> 
>
> Key: NUTCH-2526
> URL: https://issues.apache.org/jira/browse/NUTCH-2526
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Affects Versions: 1.14
>Reporter: Yash Thenuan
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> I was trying to write a parse filter plugin whose work was to parse internal 
> links as a separate document.what I did basically is,breaking the page into 
> multiple parseResults each parseResult having ParseText and ParseData 
> corresponding to the InternalLinks. I was successfully able to parse them 
> separately. But at the time of Scoring Some Error occurred.
> I am attaching the Logs for Indexing.
>  
>  2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> crawldb: crawl/crawldb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> linkdb: crawl/linkdb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces: 
> adding segment: crawl/segments/20180307130959
> 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server pool 
> to a list of 1 servers: [http://localhost:9200]
> 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi 
> thread/connection supporting pooling connection manager
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON 
> instance
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery 
> disabled...
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection 
> reaping disabled...
> 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> remaining requests [docs = 1, length = 210402, total docs = 1]
> 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> to finalize last execute
> 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous 
> took in ms 175, including wait 97
> 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner - job_local1561152089_0001
> java.lang.Exception: java.lang.NullPointerException
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
>   at 
> org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at 

[jira] [Commented] (NUTCH-2570) Deduplication job fails to install deduplicated CrawlDb

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453906#comment-16453906
 ] 

Hudson commented on NUTCH-2570:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2570 Deduplication job fails to install deduplicated CrawlDb - run 
(snagel: 
[https://github.com/apache/nutch/commit/447587881a14dd85317f33c4ccba1db5321ac0da])
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java


> Deduplication job fails to install deduplicated CrawlDb
> ---
>
> Key: NUTCH-2570
> URL: https://issues.apache.org/jira/browse/NUTCH-2570
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The DeduplicationJob ("nutch dedup") fails to install the deduplicated 
> CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
> {noformat}
> % tree crawldb
> crawldb
> ├── current
> │   └── part-r-0
> │   ├── data
> │   └── index
> └── old
> └── part-r-0
> ├── data
> └── index
> % bin/nutch dedup crawldb
> DeduplicationJob: starting at 2018-04-22 21:48:08
> Deduplication: 6 documents marked as duplicates
> Deduplication: Updating status of duplicate urls into crawl db.
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/tmp/crawldb/1742327020 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374)
> at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613)
> at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225)
> at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379)
> % tree crawldb
> crawldb
> └── old
> └── part-r-0
> ├── data
> └── index
> {noformat}
> In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left 
> without any error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2569) ClassNotFoundException when running in (pseudo-)distributed mode

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453905#comment-16453905
 ] 

Hudson commented on NUTCH-2569:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2569 ClassNotFoundException when running in (pseudo-)distributed (snagel: 
[https://github.com/apache/nutch/commit/a18d4b6672801cea2141c0469eca8a1b93c90fbd])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/indexer/CleaningJob.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
NUTCH-2569 ClassNotFoundException when running in (pseudo-)distributed (snagel: 
[https://github.com/apache/nutch/commit/d50cd12795cb39fab9c7bdab5040b35d67e917c2])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
NUTCH-2569 ClassNotFoundException when running in (pseudo-)distributed (snagel: 
[https://github.com/apache/nutch/commit/fb47207c3b11cc1f3937fd68ea37b4cb2f2e6d4b])
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java


> ClassNotFoundException when running in (pseudo-)distributed mode
> 
>
> Key: NUTCH-2569
> URL: https://issues.apache.org/jira/browse/NUTCH-2569
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.15
>
>
> The CrawlDb / updatedb job fails in pseudo-distributed mode with a 
> ClassNotFoundException:
> {noformat}
> 18/04/22 19:24:49 INFO mapreduce.Job: Task Id : 
> attempt_1524395182329_0018_m_00_0, Status : FAILED
> Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.nutch.crawl.CrawlDbFilter not found
> at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
> at 
> org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> Caused by: java.lang.ClassNotFoundException: Class 
> org.apache.nutch.crawl.CrawlDbFilter not found
> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
> at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
> {noformat}
> Must define the job jar by calling {{job.setJarByClass(...)}}. This affects 
> also other jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2571) SegmentReader -list fails to read segment

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453908#comment-16453908
 ] 

Hudson commented on NUTCH-2571:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2571 SegmentReader -list fails to read segment - fix type of value 
(snagel: 
[https://github.com/apache/nutch/commit/717d1e9f0c18dff97f21b8f626097de099fbfe11])
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java


> SegmentReader -list fails to read segment
> -
>
> Key: NUTCH-2571
> URL: https://issues.apache.org/jira/browse/NUTCH-2571
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: local + pseudo-distributed mode
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> The -list command of SegmentReader fails to read data from segments:
> {noformat}
> % bin/nutch readseg -list crawl/segments/20180409100315/ 
> Exception in thread "main" java.io.IOException: wrong value class:  is not 
> class org.apache.nutch.crawl.CrawlDatum
> at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2379)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:524)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:482)
> at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:670)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:736)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453904#comment-16453904
 ] 

Hudson commented on NUTCH-2517:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3522 (See 
[https://builds.apache.org/job/Nutch-trunk/3522/])
NUTCH-2517 mergesegs corrupts segment data - fix name of output (snagel: 
[https://github.com/apache/nutch/commit/2f50e801005493d0217160b7239eb2db82ca89f4])
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/indexer/IndexerOutputFormat.java


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like 

[jira] [Commented] (NUTCH-2527) URL filter: provide rules to exclude localhost and private address spaces

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453899#comment-16453899
 ] 

Hudson commented on NUTCH-2527:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1607 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1607/])
NUTCH-2527 URL filter: provide rules to exclude localhost and private (snagel: 
[https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5])
* (edit) conf/regex-urlfilter.txt.template


> URL filter: provide rules to exclude localhost and private address spaces
> -
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453875#comment-16453875
 ] 

Sebastian Nagel commented on NUTCH-1228:


Yes, but finally done. With the new MapReduce API (at least some of) the 
old/deprecated properties are ignored.

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2572) HostDb: updatehostdb does not set values

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2572.

Resolution: Fixed

Thanks for the review, [~markus17]!

> HostDb: updatehostdb does not set values
> 
>
> Key: NUTCH-2572
> URL: https://issues.apache.org/jira/browse/NUTCH-2572
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> {noformat}
> % bin/nutch readdb crawl/crawldb -stats -sort
> ...
> status 1 (db_unfetched):3
>nutch.apache.org :   3
> status 2 (db_fetched):  2
>nutch.apache.org :   2
> status 6 (db_notmodified):  34
>nutch.apache.org :   34
> CrawlDb statistics: done
> % bin/nutch updatehostdb -hostdb  crawl/hostdb -crawldb crawl/crawldb
> UpdateHostDb: hostdb: crawl/hostdb
> UpdateHostDb: crawldb: crawl/crawldb
> UpdateHostDb: starting at 2018-04-23 13:50:33
> UpdateHostDb: finished at 2018-04-23 13:50:35, elapsed: 00:00:01
> % bin/nutch readhostdb crawl/hostdb -get nutch.apache.org
> ReadHostDb: get: nutch.apache.org
> 0   0   0   0   0   0   0   0   0   0 
>   0.0 1970-01-01 01:00:00
> {noformat}
> Although a HostDb record is added for "nutch.apache.org", all expected values 
> (number of fetched/unfetched/... pages, fetch time 
> min/max/average/percentiles, etc.) are empty or zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2572) HostDb: updatehostdb does not set values

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453870#comment-16453870
 ] 

ASF GitHub Bot commented on NUTCH-2572:
---

sebastian-nagel closed pull request #326: NUTCH-2572 HostDb: updatehostdb does 
not set values
URL: https://github.com/apache/nutch/pull/326
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java 
b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
index 34a51037e..21c847db8 100644
--- a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
+++ b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
@@ -134,7 +134,8 @@ public void reduce(Text key, Iterable values,
 
 // Loop through all values until we find a non-empty HostDatum or use
 // an empty if this is a new host for the host db
-for (Writable value : values) {
+for (NutchWritable val : values) {
+  final Writable value = val.get(); // unwrap
   
   // Count crawl datum status's and collect metadata from fields
   if (value instanceof CrawlDatum) {
@@ -260,7 +261,7 @@ public void reduce(Text key, Iterable values,
   }
   
   // 
-  if (value instanceof HostDatum) {
+  else if (value instanceof HostDatum) {
 HostDatum buffer = (HostDatum)value;
 
 // Check homepage URL
@@ -295,9 +296,11 @@ public void reduce(Text key, Iterable 
values,
   }
 
   // Check for the score
-  if (value instanceof FloatWritable) {
+  else if (value instanceof FloatWritable) {
 FloatWritable buffer = (FloatWritable)value;
 score = buffer.get();
+  } else {
+LOG.error("Class {} not handled", value.getClass());
   }
 }
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HostDb: updatehostdb does not set values
> 
>
> Key: NUTCH-2572
> URL: https://issues.apache.org/jira/browse/NUTCH-2572
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> {noformat}
> % bin/nutch readdb crawl/crawldb -stats -sort
> ...
> status 1 (db_unfetched):3
>nutch.apache.org :   3
> status 2 (db_fetched):  2
>nutch.apache.org :   2
> status 6 (db_notmodified):  34
>nutch.apache.org :   34
> CrawlDb statistics: done
> % bin/nutch updatehostdb -hostdb  crawl/hostdb -crawldb crawl/crawldb
> UpdateHostDb: hostdb: crawl/hostdb
> UpdateHostDb: crawldb: crawl/crawldb
> UpdateHostDb: starting at 2018-04-23 13:50:33
> UpdateHostDb: finished at 2018-04-23 13:50:35, elapsed: 00:00:01
> % bin/nutch readhostdb crawl/hostdb -get nutch.apache.org
> ReadHostDb: get: nutch.apache.org
> 0   0   0   0   0   0   0   0   0   0 
>   0.0 1970-01-01 01:00:00
> {noformat}
> Although a HostDb record is added for "nutch.apache.org", all expected values 
> (number of fetched/unfetched/... pages, fetch time 
> min/max/average/percentiles, etc.) are empty or zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2571) SegmentReader -list fails to read segment

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2571.

Resolution: Fixed

Thanks, [~omkar20895] for the review!

> SegmentReader -list fails to read segment
> -
>
> Key: NUTCH-2571
> URL: https://issues.apache.org/jira/browse/NUTCH-2571
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: local + pseudo-distributed mode
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> The -list command of SegmentReader fails to read data from segments:
> {noformat}
> % bin/nutch readseg -list crawl/segments/20180409100315/ 
> Exception in thread "main" java.io.IOException: wrong value class:  is not 
> class org.apache.nutch.crawl.CrawlDatum
> at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2379)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:524)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:482)
> at org.apache.nutch.segment.SegmentReader.run(SegmentReader.java:670)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:736)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2570) Deduplication job fails to install deduplicated CrawlDb

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2570.

Resolution: Fixed

> Deduplication job fails to install deduplicated CrawlDb
> ---
>
> Key: NUTCH-2570
> URL: https://issues.apache.org/jira/browse/NUTCH-2570
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The DeduplicationJob ("nutch dedup") fails to install the deduplicated 
> CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
> {noformat}
> % tree crawldb
> crawldb
> ├── current
> │   └── part-r-0
> │   ├── data
> │   └── index
> └── old
> └── part-r-0
> ├── data
> └── index
> % bin/nutch dedup crawldb
> DeduplicationJob: starting at 2018-04-22 21:48:08
> Deduplication: 6 documents marked as duplicates
> Deduplication: Updating status of duplicate urls into crawl db.
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/tmp/crawldb/1742327020 does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:374)
> at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:613)
> at org.apache.nutch.util.FSUtils.replace(FSUtils.java:58)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:212)
> at org.apache.nutch.crawl.CrawlDb.install(CrawlDb.java:225)
> at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:366)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:379)
> % tree crawldb
> crawldb
> └── old
> └── part-r-0
> ├── data
> └── index
> {noformat}
> In pseudo-distributed mode it's even worse: only the "old" CrawlDb is left 
> without any error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2570) Deduplication job fails to install deduplicated CrawlDb

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453862#comment-16453862
 ] 

ASF GitHub Bot commented on NUTCH-2570:
---

sebastian-nagel closed pull request #323: NUTCH-2570 Deduplication job fails to 
install deduplicated CrawlDb
URL: https://github.com/apache/nutch/pull/323
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/DeduplicationJob.java 
b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
index 555f9e2eb..12ebd3c8b 100644
--- a/src/java/org/apache/nutch/crawl/DeduplicationJob.java
+++ b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
@@ -265,7 +265,7 @@ public int run(String[] args) throws IOException {
 }
 
 String group = "none";
-String crawldb = args[0];
+Path crawlDb = new Path(args[0]);
 String compareOrder = "score,fetchTime,urlLength";
 
 for (int i = 1; i < args.length; i++) {
@@ -287,17 +287,16 @@ public int run(String[] args) throws IOException {
 long start = System.currentTimeMillis();
 LOG.info("DeduplicationJob: starting at " + sdf.format(start));
 
-Path tempDir = new Path(getConf().get("mapreduce.cluster.temp.dir", ".")
-+ "/dedup-temp-"
+Path tempDir = new Path(crawlDb, "dedup-temp-"
 + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
 
 Job job = NutchJob.getInstance(getConf());
 Configuration conf = job.getConfiguration();
-job.setJobName("Deduplication on " + crawldb);
+job.setJobName("Deduplication on " + crawlDb);
 conf.set(DEDUPLICATION_GROUP_MODE, group);
 conf.set(DEDUPLICATION_COMPARE_ORDER, compareOrder);
 
-FileInputFormat.addInputPath(job, new Path(crawldb, CrawlDb.CURRENT_NAME));
+FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
 job.setInputFormatClass(SequenceFileInputFormat.class);
 
 FileOutputFormat.setOutputPath(job, tempDir);
@@ -341,28 +340,33 @@ public int run(String[] args) throws IOException {
   LOG.info("Deduplication: Updating status of duplicate urls into crawl 
db.");
 }
 
-Path dbPath = new Path(crawldb);
-Job mergeJob = CrawlDb.createJob(getConf(), dbPath);
+Job mergeJob = CrawlDb.createJob(getConf(), crawlDb);
 FileInputFormat.addInputPath(mergeJob, tempDir);
 mergeJob.setReducerClass(StatusUpdateReducer.class);
+mergeJob.setJarByClass(DeduplicationJob.class);
 
+fs = crawlDb.getFileSystem(getConf());
+Path outPath = FileOutputFormat.getOutputPath(job);
+Path lock = CrawlDb.lock(getConf(), crawlDb, false);
 try {
-  boolean success = job.waitForCompletion(true);
+  boolean success = mergeJob.waitForCompletion(true);
   if (!success) {
 String message = "Crawl job did not succeed, job status:"
-+ job.getStatus().getState() + ", reason: "
-+ job.getStatus().getFailureInfo();
++ mergeJob.getStatus().getState() + ", reason: "
++ mergeJob.getStatus().getFailureInfo();
 LOG.error(message);
 fs.delete(tempDir, true);
+NutchJob.cleanupAfterFailure(outPath, lock, fs);
 throw new RuntimeException(message);
   }
 } catch (IOException | InterruptedException | ClassNotFoundException e) {
   LOG.error("DeduplicationMergeJob: " + StringUtils.stringifyException(e));
   fs.delete(tempDir, true);
+  NutchJob.cleanupAfterFailure(outPath, lock, fs);
   return -1;
 }
 
-CrawlDb.install(mergeJob, dbPath);
+CrawlDb.install(mergeJob, crawlDb);
 
 // clean up
 fs.delete(tempDir, true);


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Deduplication job fails to install deduplicated CrawlDb
> ---
>
> Key: NUTCH-2570
> URL: https://issues.apache.org/jira/browse/NUTCH-2570
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The DeduplicationJob ("nutch dedup") fails to install the deduplicated 
> CrawlDb and leaves only the "old" crawldb (if "db.preserve.backup" is true):
> {noformat}
> % tree crawldb
> crawldb
> ├── current
> │   └── part-r-0
> │   ├── data
> │   └── index
> └── old
> └── part-r-0
> 

[jira] [Resolved] (NUTCH-2569) ClassNotFoundException when running in (pseudo-)distributed mode

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2569.

Resolution: Fixed

> ClassNotFoundException when running in (pseudo-)distributed mode
> 
>
> Key: NUTCH-2569
> URL: https://issues.apache.org/jira/browse/NUTCH-2569
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.15
>
>
> The CrawlDb / updatedb job fails in pseudo-distributed mode with a 
> ClassNotFoundException:
> {noformat}
> 18/04/22 19:24:49 INFO mapreduce.Job: Task Id : 
> attempt_1524395182329_0018_m_00_0, Status : FAILED
> Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
> org.apache.nutch.crawl.CrawlDbFilter not found
> at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
> at 
> org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> Caused by: java.lang.ClassNotFoundException: Class 
> org.apache.nutch.crawl.CrawlDbFilter not found
> at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
> at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
> {noformat}
> Must define the job jar by calling {{job.setJarByClass(...)}}. This affects 
> also other jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2569) ClassNotFoundException when running in (pseudo-)distributed mode

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453860#comment-16453860
 ] 

ASF GitHub Bot commented on NUTCH-2569:
---

sebastian-nagel closed pull request #322: NUTCH-2569 ClassNotFoundException 
when running in (pseudo-)distributed mode
URL: https://github.com/apache/nutch/pull/322
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java 
b/src/java/org/apache/nutch/crawl/CrawlDb.java
index 7af3b6b19..333a7b68b 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDb.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDb.java
@@ -181,6 +181,7 @@ public static Job createJob(Configuration config, Path 
crawlDb)
 
 job.setMapperClass(CrawlDbFilter.class);
 job.setReducerClass(CrawlDbReducer.class);
+job.setJarByClass(CrawlDb.class);
 
 FileOutputFormat.setOutputPath(job, newCrawlDb);
 job.setOutputFormatClass(MapFileOutputFormat.class);
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java 
b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
index 3a83e416e..4d9ce0d90 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
@@ -77,7 +77,8 @@
 public void close() throws IOException {
 }
 
-public void setup(Reducer.Context context) {
+public void setup(
+Reducer.Context context) {
   Configuration conf = context.getConfiguration();
   schedule = FetchScheduleFactory.getFetchSchedule(conf);
 }
@@ -179,6 +180,7 @@ public static Job createMergeJob(Configuration conf, Path 
output,
 
 job.setInputFormatClass(SequenceFileInputFormat.class);
 
+job.setJarByClass(CrawlDbMerger.class);
 job.setMapperClass(CrawlDbFilter.class);
 conf.setBoolean(CrawlDbFilter.URL_FILTERING, filter);
 conf.setBoolean(CrawlDbFilter.URL_NORMALIZING, normalize);
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 87bf58525..7e00ece5b 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -390,6 +390,7 @@ public void close() {
  FileInputFormat.addInputPath(job, new Path(crawlDb, 
CrawlDb.CURRENT_NAME));
  job.setInputFormatClass(SequenceFileInputFormat.class);
 
+ job.setJarByClass(CrawlDbReader.class);
  job.setMapperClass(CrawlDbStatMapper.class);
  job.setCombinerClass(CrawlDbStatReducer.class);
  job.setReducerClass(CrawlDbStatReducer.class);
@@ -690,6 +691,7 @@ public void processDumpJob(String crawlDb, String output,
 job.setMapperClass(CrawlDbDumpMapper.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(CrawlDatum.class);
+job.setJarByClass(CrawlDbReader.class);
 
 try {
   boolean success = job.waitForCompletion(true);
@@ -794,6 +796,8 @@ public void processTopNJob(String crawlDb, long topN, float 
min,
 job.setJobName("topN prepare " + crawlDb);
 FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
 job.setInputFormatClass(SequenceFileInputFormat.class);
+
+job.setJarByClass(CrawlDbReader.class);
 job.setMapperClass(CrawlDbTopNMapper.class);
 job.setReducerClass(Reducer.class);
 
@@ -832,6 +836,7 @@ public void processTopNJob(String crawlDb, long topN, float 
min,
 job.setInputFormatClass(SequenceFileInputFormat.class);
 job.setMapperClass(Mapper.class);
 job.setReducerClass(CrawlDbTopNReducer.class);
+job.setJarByClass(CrawlDbReader.class);
 
 FileOutputFormat.setOutputPath(job, outFolder);
 job.setOutputFormatClass(TextOutputFormat.class);
diff --git a/src/java/org/apache/nutch/crawl/DeduplicationJob.java 
b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
index 555f9e2eb..eaeb83581 100644
--- a/src/java/org/apache/nutch/crawl/DeduplicationJob.java
+++ b/src/java/org/apache/nutch/crawl/DeduplicationJob.java
@@ -296,6 +296,7 @@ public int run(String[] args) throws IOException {
 job.setJobName("Deduplication on " + crawldb);
 conf.set(DEDUPLICATION_GROUP_MODE, group);
 conf.set(DEDUPLICATION_COMPARE_ORDER, compareOrder);
+job.setJarByClass(DeduplicationJob.class);
 
 FileInputFormat.addInputPath(job, new Path(crawldb, CrawlDb.CURRENT_NAME));
 job.setInputFormatClass(SequenceFileInputFormat.class);
diff --git a/src/java/org/apache/nutch/crawl/Generator.java 
b/src/java/org/apache/nutch/crawl/Generator.java
index a3ef91c89..9c22ee228 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ 

[jira] [Resolved] (NUTCH-2517) mergesegs corrupts segment data

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2517.

Resolution: Fixed

Thanks, [~mebbinghaus]! Thanks, [~lewismc]! With [PR 
#312|https://github.com/apache/nutch/pull/321] the directory structure of a 
merged segment looks correct:
{noformat}
.../mergedsegs/20180426130537
|-- content
|   `-- part-r-0
|   |-- data
|   `-- index
|-- crawl_fetch
|   `-- part-r-0
|   |-- data
|   `-- index
|-- crawl_generate
|   `-- part-r-0
|-- crawl_parse
|   `-- part-r-0
|-- parse_data
|   `-- part-r-0
|   |-- data
|   `-- index
`-- parse_text
`-- part-r-0
|-- data
`-- index
{noformat}
Tested in local and pseudo-distributed mode. I've also verified that the merged 
segment can be read, see [test 
script|https://github.com/sebastian-nagel/nutch-test-single-node-cluster/blob/master/test_nutch_tools.sh#L67].
 Next week I'll plan to test on a real Hadoop cluster.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Assignee: Lewis John McGibbney
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png, 
> Screenshot_2018-03-07_07-50-05.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at 

[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453849#comment-16453849
 ] 

ASF GitHub Bot commented on NUTCH-2517:
---

sebastian-nagel closed pull request #321: NUTCH-2517 mergesegs corrupts segment 
data
URL: https://github.com/apache/nutch/pull/321
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java 
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 87bf58525..424db3d5a 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -99,7 +99,6 @@ private void openReaders(String crawlDb, Configuration config)
 if (readers != null)
   return;
 Path crawlDbPath = new Path(crawlDb, CrawlDb.CURRENT_NAME);
-FileSystem fs = crawlDbPath.getFileSystem(config);
 readers = MapFileOutputFormat.getReaders(crawlDbPath, config);
   }
 
@@ -180,7 +179,7 @@ public synchronized void close(TaskAttemptContext context) 
throws IOException {
 
 public RecordWriter getRecordWriter(TaskAttemptContext
 context) throws IOException {
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
   DataOutputStream fileOut = fs.create(new Path(dir, name), context);
diff --git a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java 
b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
index 54b98dfce..359f9d1fc 100644
--- a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
+++ b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
@@ -34,7 +34,7 @@
 Configuration conf = context.getConfiguration();
 final IndexWriters writers = new IndexWriters(conf);
 
-String name = context.getTaskAttemptID().toString();
+String name = getUniqueFile(context, "part", "");
 writers.open(conf, name);
 
 return new RecordWriter() {
diff --git a/src/java/org/apache/nutch/segment/SegmentMerger.java 
b/src/java/org/apache/nutch/segment/SegmentMerger.java
index b1f1d8948..f4adf52b4 100644
--- a/src/java/org/apache/nutch/segment/SegmentMerger.java
+++ b/src/java/org/apache/nutch/segment/SegmentMerger.java
@@ -139,7 +139,6 @@
 throws IOException {
 
   context.setStatus(split.toString());
-  Configuration conf = context.getConfiguration();
 
   // find part name
   SegmentPart segmentPart;
@@ -213,7 +212,7 @@ public synchronized void close() throws IOException {
 public RecordWriter getRecordWriter(TaskAttemptContext 
context)
 throws IOException {
   Configuration conf = context.getConfiguration();
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
 
diff --git a/src/java/org/apache/nutch/segment/SegmentReader.java 
b/src/java/org/apache/nutch/segment/SegmentReader.java
index 0b65a2b81..7193c58f7 100644
--- a/src/java/org/apache/nutch/segment/SegmentReader.java
+++ b/src/java/org/apache/nutch/segment/SegmentReader.java
@@ -106,8 +106,7 @@ public void map(WritableComparable key, Writable value,
   FileOutputFormat {
 public RecordWriter getRecordWriter(
 TaskAttemptContext context) throws IOException, InterruptedException {
-  Configuration conf = context.getConfiguration();
-  String name = context.getTaskAttemptID().toString();
+  String name = getUniqueFile(context, "part", "");
   Path dir = FileOutputFormat.getOutputPath(context);
   FileSystem fs = dir.getFileSystem(context.getConfiguration());
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
> 

[jira] [Resolved] (NUTCH-2526) NPE in scoring-opic when indexing document without CrawlDb datum

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2526.

Resolution: Fixed

Thanks, [~yash21]!

> NPE in scoring-opic when indexing document without CrawlDb datum
> 
>
> Key: NUTCH-2526
> URL: https://issues.apache.org/jira/browse/NUTCH-2526
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, scoring
>Affects Versions: 1.14
>Reporter: Yash Thenuan
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> I was trying to write a parse filter plugin whose work was to parse internal 
> links as a separate document.what I did basically is,breaking the page into 
> multiple parseResults each parseResult having ParseText and ParseData 
> corresponding to the InternalLinks. I was successfully able to parse them 
> separately. But at the time of Scoring Some Error occurred.
> I am attaching the Logs for Indexing.
>  
>  2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> crawldb: crawl/crawldb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> linkdb: crawl/linkdb
> 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces: 
> adding segment: crawl/segments/20180307130959
> 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server pool 
> to a list of 1 servers: [http://localhost:9200]
> 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi 
> thread/connection supporting pooling connection manager
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON 
> instance
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery 
> disabled...
> 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection 
> reaping disabled...
> 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> remaining requests [docs = 1, length = 210402, total docs = 1]
> 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter - Processing 
> to finalize last execute
> 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous 
> took in ms 175, including wait 97
> 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner - job_local1561152089_0001
> java.lang.Exception: java.lang.NullPointerException
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
>   at 
> org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
>   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
>   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2526) NPE in scoring-opic when indexing document without CrawlDb datum

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453844#comment-16453844
 ] 

ASF GitHub Bot commented on NUTCH-2526:
---

sebastian-nagel closed pull request #324: NUTCH-2526 NPE in scoring-opic when 
indexing document without CrawlDb datum
URL: https://github.com/apache/nutch/pull/324
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/scoring/ScoringFilter.java 
b/src/java/org/apache/nutch/scoring/ScoringFilter.java
index c1acc482f..2941980f2 100644
--- a/src/java/org/apache/nutch/scoring/ScoringFilter.java
+++ b/src/java/org/apache/nutch/scoring/ScoringFilter.java
@@ -193,17 +193,22 @@ public default void orphanedScore(Text url, CrawlDatum 
datum)
   }
 
   /**
-   * This method calculates a Lucene document boost.
+   * This method calculates a indexed document score/boost.
* 
* @param url
*  url of the page
* @param doc
-   *  Lucene document. NOTE: this already contains all information
+   *  indexed document. NOTE: this already contains all information
*  collected by indexing filters. Implementations may modify this
*  instance, in order to store/remove some information.
* @param dbDatum
-   *  current page from CrawlDb. NOTE: changes made to this instance 
are
-   *  not persisted.
+   *  current page from CrawlDb. NOTE:
+   *  
+   *  changes made to this instance are not persisted
+   *  may be null if indexing is done without CrawlDb or if the
+   *  segment is generated not from the CrawlDb (via
+   *  FreeGenerator).
+   *  
* @param fetchDatum
*  datum from FetcherOutput (containing among others the fetching
*  status)
@@ -214,10 +219,10 @@ public default void orphanedScore(Text url, CrawlDatum 
datum)
*  current inlinks from LinkDb. NOTE: changes made to this instance
*  are not persisted.
* @param initScore
-   *  initial boost value for the Lucene document.
-   * @return boost value for the Lucene document. This value is passed as an
+   *  initial boost value for the indexed document.
+   * @return boost value for the indexed document. This value is passed as an
* argument to the next scoring filter in chain. NOTE: 
implementations
-   * may also express other scoring strategies by modifying Lucene
+   * may also express other scoring strategies by modifying the indexed
* document directly.
* @throws ScoringFilterException
*/
diff --git 
a/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
 
b/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
index a143f46a9..c98ccce44 100644
--- 
a/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
+++ 
b/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
@@ -36,6 +36,7 @@
 
   private Configuration conf;
   private float normalizedScore = 1.00f;
+  private float initialScore = 0.0f;
 
   public LinkAnalysisScoringFilter() {
 
@@ -64,12 +65,15 @@ public float generatorSortValue(Text url, CrawlDatum datum, 
float initSort)
   public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum,
   CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
   throws ScoringFilterException {
+if (dbDatum == null) {
+  return initScore;
+}
 return (normalizedScore * dbDatum.getScore());
   }
 
   public void initialScore(Text url, CrawlDatum datum)
   throws ScoringFilterException {
-datum.setScore(0.0f);
+datum.setScore(initialScore);
   }
 
   public void injectedScore(Text url, CrawlDatum datum)
diff --git 
a/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
 
b/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
index 530f267f1..5a080bed2 100644
--- 
a/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
+++ 
b/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
@@ -167,6 +167,9 @@ public CrawlDatum distributeScoreToOutlinks(Text fromUrl,
   public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum,
   CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
   throws ScoringFilterException {
+if (dbDatum == null) {
+  return initScore;
+}
 return (float) Math.pow(dbDatum.getScore(), scorePower) * initScore;
   }
 }



[jira] [Resolved] (NUTCH-2544) Nutch 1.15 no longer compatible with AWS EMR and S3

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2544.

Resolution: Fixed

Thanks, [~sjwoodard]! I had no time to finally test the solution on EMR but 
hope to do so next week as soon all known issues related to the upgrade of the 
MapReduce API are fixed.

> Nutch 1.15 no longer compatible with AWS EMR and S3
> ---
>
> Key: NUTCH-2544
> URL: https://issues.apache.org/jira/browse/NUTCH-2544
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, generator
>Affects Versions: 1.15
>Reporter: Steven W
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> Nutch 1.14 is working OK with AWS EMR and S3 storage, but NUTCH-2375 appears 
> to have broken this.
> Generator partitioning fails with Error: java.lang.NullPointerException at 
> org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:75)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2527) URL filter: provide rules to exclude localhost and private address spaces

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453835#comment-16453835
 ] 

Hudson commented on NUTCH-2527:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3521 (See 
[https://builds.apache.org/job/Nutch-trunk/3521/])
NUTCH-2527 URL filter: provide rules to exclude localhost and private (snagel: 
[https://github.com/apache/nutch/commit/b9e18d900ee574e7040eccffdb183373c3e99ba1])
* (edit) conf/regex-urlfilter.txt.template


> URL filter: provide rules to exclude localhost and private address spaces
> -
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2544) Nutch 1.15 no longer compatible with AWS EMR and S3

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453830#comment-16453830
 ] 

ASF GitHub Bot commented on NUTCH-2544:
---

sebastian-nagel closed pull request #320: NUTCH-2544 Nutch 1.15 no longer 
compatible with AWS EMR and S3
URL: https://github.com/apache/nutch/pull/320
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java 
b/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
index 56b24e4a7..9feb7458d 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
@@ -47,15 +47,11 @@
   @Override
   public void checkOutputSpecs(JobContext job) throws IOException {
 Configuration conf = job.getConfiguration();
-FileSystem fs = FileSystem.get(conf);
 Path out = FileOutputFormat.getOutputPath(job);
 if ((out == null) && (job.getNumReduceTasks() != 0)) {
   throw new InvalidJobConfException("Output directory not set in conf.");
 }
-
-if (fs == null) {
-  fs = out.getFileSystem(conf);
-}
+FileSystem fs = out.getFileSystem(conf);
 if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME))) {
   throw new IOException("Segment already fetched!");
 }
diff --git a/src/java/org/apache/nutch/util/SitemapProcessor.java 
b/src/java/org/apache/nutch/util/SitemapProcessor.java
index ea28550bd..0762ae4a7 100644
--- a/src/java/org/apache/nutch/util/SitemapProcessor.java
+++ b/src/java/org/apache/nutch/util/SitemapProcessor.java
@@ -336,7 +336,7 @@ public void sitemap(Path crawldb, Path hostdb, Path 
sitemapUrlDir, boolean stric
   LOG.info("SitemapProcessor: Starting at {}", sdf.format(start));
 }
 
-FileSystem fs = FileSystem.get(getConf());
+FileSystem fs = crawldb.getFileSystem(getConf());
 Path old = new Path(crawldb, "old");
 Path current = new Path(crawldb, "current");
 Path tempCrawlDb = new Path(crawldb, "crawldb-" + Integer.toString(new 
Random().nextInt(Integer.MAX_VALUE)));


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Nutch 1.15 no longer compatible with AWS EMR and S3
> ---
>
> Key: NUTCH-2544
> URL: https://issues.apache.org/jira/browse/NUTCH-2544
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, generator
>Affects Versions: 1.15
>Reporter: Steven W
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> Nutch 1.14 is working OK with AWS EMR and S3 storage, but NUTCH-2375 appears 
> to have broken this.
> Generator partitioning fails with Error: java.lang.NullPointerException at 
> org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:75)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453826#comment-16453826
 ] 

Markus Jelsma commented on NUTCH-1228:
--

Wow, this is ancient! Thanks!

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453808#comment-16453808
 ] 

Sebastian Nagel edited comment on NUTCH-1228 at 4/26/18 10:44 AM:
--

Fixed for 1.x (together with NUTCH-2552) and 2.x ([PR 
#319|https://github.com/apache/nutch/pull/319]). Thanks!


was (Author: wastl-nagel):
Fixed for 1.x (together with NUTCH-2552) and 2.x ([PR 
#319](https://github.com/apache/nutch/pull/319)). Thanks!

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2527) URL filter: provide rules to exclude localhost and private address spaces

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2527.

Resolution: Implemented

Committed to 1.x 
([1475fa3|https://github.com/apache/nutch/commit/1475fa3320897493124ab4339ee4728ac9a876ea])
 and 2.x 
([d62ece0|https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5]).

> URL filter: provide rules to exclude localhost and private address spaces
> -
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453822#comment-16453822
 ] 

Hudson commented on NUTCH-1228:
---

SUCCESS: Integrated in Jenkins build Nutch-nutchgora #1606 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1606/])
NUTCH-1228 Change mapred.task.timeout to mapreduce.task.timeout in (snagel: 
[https://github.com/apache/nutch/commit/418f93a4609658cffe8b02841a9db0c0025de865])
* (edit) src/java/org/apache/nutch/indexer/CleaningJob.java
* (edit) src/java/org/apache/nutch/crawl/WebTableReader.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/bin/crawl
* (edit) src/java/org/apache/nutch/fetcher/FetcherReducer.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherJob.java


> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2527) URL filter: provide rules to exclude localhost and private address spaces

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2527:
--

Assignee: Sebastian Nagel

> URL filter: provide rules to exclude localhost and private address spaces
> -
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2527) URL filter: provide rules to exclude localhost and private address spaces

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453811#comment-16453811
 ] 

ASF GitHub Bot commented on NUTCH-2527:
---

sebastian-nagel closed pull request #292: NUTCH-2527 URL filter: provide rules 
to exclude localhost and private address spaces
URL: https://github.com/apache/nutch/pull/292
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/regex-urlfilter.txt.template 
b/conf/regex-urlfilter.txt.template
index bcf9c87d7..b060cbb7b 100644
--- a/conf/regex-urlfilter.txt.template
+++ b/conf/regex-urlfilter.txt.template
@@ -16,6 +16,7 @@
 
 # The default url filter.
 # Better for whole-internet crawling.
+# Please comment/uncomment rules to your needs.
 
 # Each non-comment, non-blank line contains a regular expression
 # prefixed by '+' or '-'.  The first matching pattern in the file
@@ -35,5 +36,26 @@
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
+# For safe web crawling if crawled content is exposed in a public search 
interface:
+# - exclude private network addresses to avoid that information
+#   can be leaked by placing links pointing to web interfaces of services
+#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
+# - in addition, file:// URLs should be either excluded by a URL filter rule
+#   or ignored by not enabling protocol-file
+#
+# - exclude localhost and loop-back addresses
+# http://localhost:8080
+# http://127.0.0.1/ .. http://127.255.255.255/
+# http://[::1]/
+#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
+#
+# - exclude private IP address spaces
+# 10.0.0.0/8
+#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
+# 192.168.0.0/16
+#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+# 172.16.0.0/12
+#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+
 # accept anything else
 +.


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> URL filter: provide rules to exclude localhost and private address spaces
> -
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1228.

   Resolution: Fixed
Fix Version/s: 1.15

Fixed for 1.x (together with NUTCH-2552) and 2.x ([PR 
#319](https://github.com/apache/nutch/pull/319)). Thanks!

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4, 1.15
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1228:
---
Affects Version/s: 2.3.1
   1.14

> Change mapred.task.timeout to mapreduce.task.timeout in fetcher
> ---
>
> Key: NUTCH-1228
> URL: https://issues.apache.org/jira/browse/NUTCH-1228
> Project: Nutch
>  Issue Type: Task
>  Components: fetcher
>Affects Versions: 2.3.1, 1.14
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 2.4
>
> Attachments: NUTCH-1228-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2018-04-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453805#comment-16453805
 ] 

ASF GitHub Bot commented on NUTCH-1228:
---

sebastian-nagel closed pull request #319: NUTCH-1228 Change mapred.task.timeout 
to mapreduce.task.timeout in fetcher
URL: https://github.com/apache/nutch/pull/319
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/bin/crawl b/src/bin/crawl
index 1a31d7d0a..27db6de6c 100644
--- a/src/bin/crawl
+++ b/src/bin/crawl
@@ -61,7 +61,7 @@ fi
 numSlaves=1
 
 # and the total number of available tasks
-# sets Hadoop parameter "mapred.reduce.tasks"
+# sets Hadoop parameter "mapreduce.job.reduces"
 numTasks=`expr $numSlaves \* 2`
 
 # number of urls to fetch in one iteration
@@ -88,7 +88,7 @@ fi
 
 # note that some of the options listed here could be set in the 
 # corresponding hadoop site xml param file 
-commonOptions="-D mapred.reduce.tasks=$numTasks -D 
mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"
+commonOptions="-D mapreduce.job.reduces=$numTasks -D 
mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D 
mapreduce.map.speculative=false -D mapreduce.map.output.compress=true"
 
  # check that hadoop can be found on the path 
 if [ $mode = "distributed" ]; then
@@ -161,7 +161,7 @@ do
   echo "Parsing : "
   # enable the skipping of records for the parsing so that a dodgy document 
   # so that it does not fail the full task
-  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1"
+  skipRecordsOptions="-D mapreduce.task.skip.start.attempts=2 -D 
mapreduce.map.skip.maxrecords=1"
   __bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId 
"$CRAWL_ID"
 
   # updatedb with this batch
diff --git a/src/java/org/apache/nutch/crawl/WebTableReader.java 
b/src/java/org/apache/nutch/crawl/WebTableReader.java
index 5985dd6cf..941ae9ac4 100644
--- a/src/java/org/apache/nutch/crawl/WebTableReader.java
+++ b/src/java/org/apache/nutch/crawl/WebTableReader.java
@@ -539,7 +539,7 @@ public int run(String[] args) throws Exception {
   // for now handles only -stat
   @Override
   public Map run(Map args) throws Exception {
-Path tmpFolder = new Path(getConf().get("mapred.temp.dir", ".")
+Path tmpFolder = new Path(getConf().get("mapreduce.cluster.temp.dir", ".")
 + "stat_tmp" + System.currentTimeMillis());
 
 numJobs = 1;
diff --git a/src/java/org/apache/nutch/fetcher/FetcherJob.java 
b/src/java/org/apache/nutch/fetcher/FetcherJob.java
index bd06121b2..82e7a126c 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherJob.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherJob.java
@@ -214,7 +214,7 @@ public FetcherJob(Configuration conf) {
 StorageUtils.initReducerJob(currentJob, FetcherReducer.class);
 if (numTasks == null || numTasks < 1) {
   currentJob.setNumReduceTasks(currentJob.getConfiguration().getInt(
-  "mapred.map.tasks", currentJob.getNumReduceTasks()));
+  "mapreduce.job.maps", currentJob.getNumReduceTasks()));
 } else {
   currentJob.setNumReduceTasks(numTasks);
 }
@@ -247,7 +247,7 @@ public FetcherJob(Configuration conf) {
* @param shouldResume
* @param numTasks
*  number of fetching tasks (reducers). If set to  1 then use 
the
-   *  default, which is mapred.map.tasks.
+   *  default, which is mapreduce.job.maps.
* @return 0 on success
* @throws Exception
*/
@@ -267,7 +267,7 @@ public int fetch(String batchId, int threads, boolean 
shouldResume,
* @param shouldResume
* @param numTasks
*  number of fetching tasks (reducers). If set to  1 then use 
the
-   *  default, which is mapred.map.tasks.
+   *  default, which is mapreduce.job.maps.
* @param stmDetect
*  If set true, sitemap detection is run.
* @param sitemap
@@ -326,7 +326,7 @@ public int run(String[] args) throws Exception {
 + "-crawlId  - the id to prefix the schemas to operate on, \n 
\t \t(default: storage.crawl.id)\n"
 + "-threads N- number of fetching threads per task\n"
 + "-resume   - resume interrupted job\n"
-+ "-numTasks N   - if N > 0 then use this many reduce tasks for 
fetching \n \t \t(default: mapred.map.tasks)"
++ "-numTasks N   - if N > 0 then use this many reduce tasks for 
fetching \n \t \t(default: mapreduce.job.maps)"
 + "-sitemap  - only sitemap files are fetched, defaults to