[jira] [Commented] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554866#comment-17554866
 ] 

Lewis John McGibbney commented on NUTCH-2940:
-

WIP PR available at https://github.com/apache/nutch/pull/735

> Develop Gradle Core Build for Apache Nutch
> --
>
> Key: NUTCH-2940
> URL: https://issues.apache.org/jira/browse/NUTCH-2940
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Reporter: James Simmons
>Assignee: Lewis John McGibbney
>Priority: Major
>
> This issue will focus on the build lifecycle management for the core build of 
> Apache Nutch as seen here: 
> [https://github.com/apache/nutch/tree/master/src/java|https://github.com/apache/nutch/tree/master/src/java]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index files not working

2022-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554865#comment-17554865
 ] 

ASF GitHub Bot commented on NUTCH-2490:
---

lewismc commented on PR #735:
URL: https://github.com/apache/nutch/pull/735#issuecomment-1157159803

   I'll squash and merge commits into one when we are ready to merge into 
`master` branch.




> Sitemap processing: Sitemap index files not working
> ---
>
> Key: NUTCH-2490
> URL: https://issues.apache.org/jira/browse/NUTCH-2490
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] 
> does not properly handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] lewismc commented on pull request #735: NUTCH-2490 Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread GitBox


lewismc commented on PR #735:
URL: https://github.com/apache/nutch/pull/735#issuecomment-1157159803

   I'll squash and merge commits into one when we are ready to merge into 
`master` branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [nutch] lewismc opened a new pull request, #735: Nutch 2940

2022-06-15 Thread GitBox


lewismc opened a new pull request, #735:
URL: https://github.com/apache/nutch/pull/735

   This is a WIP for https://issues.apache.org/jira/browse/NUTCH-2940. The work 
was conducted by @AzureTriple @imanzanganeh @jbsimmon @LilyPerr and 
@Lirongxuan1 from the 2022 USC Senior CS Capstone Program.
   
   Most of the core build is in place. No plugin sub-projects have been 
implemented yet.
   
   I intend to continue work on the code build until it is completed. I will 
then move on to the plugin sub-projects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch

2022-06-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2940:
---

Assignee: Lewis John McGibbney

> Develop Gradle Core Build for Apache Nutch
> --
>
> Key: NUTCH-2940
> URL: https://issues.apache.org/jira/browse/NUTCH-2940
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Reporter: James Simmons
>Assignee: Lewis John McGibbney
>Priority: Major
>
> This issue will focus on the build lifecycle management for the core build of 
> Apache Nutch as seen here: 
> [https://github.com/apache/nutch/tree/master/src/java|https://github.com/apache/nutch/tree/master/src/java]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554859#comment-17554859
 ] 

ASF GitHub Bot commented on NUTCH-2936:
---

lewismc commented on PR #733:
URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313

   This is exciting!!! Excellent debugging 👍 ... you got further than me.
   I can't get around to testing it until next week at earliest. 
   Thinking back, I did observe revisits (recursive access) to 
URLStreamHandlerFactory but didn't pursue that line of inquiry at that point in 
time.
   To get a bit more context I did review 
[HADOOP-14598-005.patch](https://issues.apache.org/jira/secure/attachment/12880380/HADOOP-14598-005.patch)
 and the current class it affects. Reading the code it makes more sense but 
admittedly until I debug this I still don't have the full context.
   I took a look at [hadoop-hdfs 
TestUrlStreamHandler.java](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestUrlStreamHandler.java)
 as well which I really like the look of. To build out some more confidence in 
this aspect of the codebase, we could create some tests for the [nutch 
URLStreamHandlerFactory.java](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java).




> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] lewismc commented on pull request #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode

2022-06-15 Thread GitBox


lewismc commented on PR #733:
URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313

   This is exciting!!! Excellent debugging 👍 ... you got further than me.
   I can't get around to testing it until next week at earliest. 
   Thinking back, I did observe revisits (recursive access) to 
URLStreamHandlerFactory but didn't pursue that line of inquiry at that point in 
time.
   To get a bit more context I did review 
[HADOOP-14598-005.patch](https://issues.apache.org/jira/secure/attachment/12880380/HADOOP-14598-005.patch)
 and the current class it affects. Reading the code it makes more sense but 
admittedly until I debug this I still don't have the full context.
   I took a look at [hadoop-hdfs 
TestUrlStreamHandler.java](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestUrlStreamHandler.java)
 as well which I really like the look of. To build out some more confidence in 
this aspect of the codebase, we could create some tests for the [nutch 
URLStreamHandlerFactory.java](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554702#comment-17554702
 ] 

Sebastian Nagel commented on NUTCH-2936:


Update: the issue is reproducible also in local mode after upgrading to a more 
recent Hadoop (3.3.3, see [PR #734|https://github.com/apache/nutch/pull/734]) 
and using protocol-okhttp.

> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554701#comment-17554701
 ] 

ASF GitHub Bot commented on NUTCH-2952:
---

sebastian-nagel commented on PR #734:
URL: https://github.com/apache/nutch/pull/734#issuecomment-1156706013

   Update: the failing unit test (TestCrawlDbDeduplication) on my development 
system stem from a modified nutch-site.xml requesting protocol-okhttp - 
obviously, it's the combination of using protocol-okhttp and a more recent 
Hadoop version which triggers the appearance of NUTCH-2936: the issue 
(NoSuchAlgorithmException) is then reproducible also in local mode.




> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel commented on pull request #734: NUTCH-2952 Upgrade core dependencies

2022-06-15 Thread GitBox


sebastian-nagel commented on PR #734:
URL: https://github.com/apache/nutch/pull/734#issuecomment-1156706013

   Update: the failing unit test (TestCrawlDbDeduplication) on my development 
system stem from a modified nutch-site.xml requesting protocol-okhttp - 
obviously, it's the combination of using protocol-okhttp and a more recent 
Hadoop version which triggers the appearance of NUTCH-2936: the issue 
(NoSuchAlgorithmException) is then reproducible also in local mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554663#comment-17554663
 ] 

ASF GitHub Bot commented on NUTCH-2952:
---

sebastian-nagel opened a new pull request, #734:
URL: https://github.com/apache/nutch/pull/734

   Upgrade of core dependencies
   - Hadoop 3.1.3 -> 3.3.3
   - log4j 2.17.0 -> 2.17.2
   - and some more
   
   Note: I've observed that some unit tests are failing with same/similar 
errors than observed in NUTCH-2936. Eventually, this PR needs to be rebased on 
top of #733.




> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #734: NUTCH-2952 Upgrade core dependencies

2022-06-15 Thread GitBox


sebastian-nagel opened a new pull request, #734:
URL: https://github.com/apache/nutch/pull/734

   Upgrade of core dependencies
   - Hadoop 3.1.3 -> 3.3.3
   - log4j 2.17.0 -> 2.17.2
   - and some more
   
   Note: I've observed that some unit tests are failing with same/similar 
errors than observed in NUTCH-2936. Eventually, this PR needs to be rebased on 
top of #733.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2952:
--

Assignee: Sebastian Nagel

> Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
> --
>
> Key: NUTCH-2952
> URL: https://issues.apache.org/jira/browse/NUTCH-2952
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some 
> more.
> - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
> full support for Java 11 and ARM architectures



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)

2022-06-15 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2952:
--

 Summary: Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
 Key: NUTCH-2952
 URL: https://issues.apache.org/jira/browse/NUTCH-2952
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.18
Reporter: Sebastian Nagel
 Fix For: 1.19


Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some more.

- [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces 
full support for Java 11 and ARM architectures




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers

2022-06-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554564#comment-17554564
 ] 

Sebastian Nagel commented on NUTCH-2949:


This is addressed in [PR#733|https://github.com/apache/nutch/pull/733].

> Tasks of a multi-threaded map runner may fail because of slow creation of URL 
> stream handlers
> -
>
> Key: NUTCH-2949
> URL: https://issues.apache.org/jira/browse/NUTCH-2949
> Project: Nutch
>  Issue Type: Bug
>  Components: net, plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.19
>
>
> While running a custom Nutch job ([code 
> here|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]),
>  many but not all task failed exceeding the the Hadoop task time-out 
> (`mapreduce.task.timeout`) without generating any "heartbeat" (output, 
> counter increments, log messages). Hadoop logs the stacks of all threads of 
> the timed out task. That's the base for the excerpts below.
> The job runs a MultithreadedMapper - most of the mapper threads (48 in total) 
> are waiting for the URLStreamHandler in order to construct a java.net.URL 
> object:
> {noformat}
> "Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s 
> tid=0x7f3eb5b0f800 nid=0x8e651 waiting for monitor entry  
> [0x7f3e84ef9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>         - waiting to lock <0x0006a1bc0630> (a java.lang.String)
>         at 
> org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
>         at 
> org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
>         at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
>         at java.net.URL.(java.base@11.0.15/URL.java:651)
>         at java.net.URL.(java.base@11.0.15/URL.java:541)
>         at java.net.URL.(java.base@11.0.15/URL.java:488)
>         at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
>         at 
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
>         at 
> org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325)
>         at 
> org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at 
> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274)
> {noformat}
> Only a single mapper thread is active:
> {noformat}
> "Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s 
> tid=0x7f3eb5b42800 nid=0x8e661 in Object.wait()  [0x7f3e842ec000]
>java.lang.Thread.State: RUNNABLE
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native
>  Method)
> at 
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62)
> at 
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45)
> at 
> java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490)
> at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170)
> - locked <0x0006a1bc0630> (a java.lang.String)
> at 
> org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597)
> at 
> org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
> at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432)
> at java.net.URL.(java.base@11.0.15/URL.java:651)
> at java.net.URL.(java.base@11.0.15/URL.java:541)
> at java.net.URL.(java.base@11.0.15/URL.java:488)
> at 
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179)
> at 
> org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318)
> at 
> org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157)
> at 
> org.apache.nutch.crawl

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554563#comment-17554563
 ] 

ASF GitHub Bot commented on NUTCH-2936:
---

sebastian-nagel opened a new pull request, #733:
URL: https://github.com/apache/nutch/pull/733

   Fixes to address the failing of Nutch jobs in (pseudo-)distributed mode. 
Implements:
   - caching of URLStreamHandlers per protocol to avoid that handlers are 
created anew
   
   - enforce routing of standard protocols (http, https, file, jar) to handlers 
implemented by the JVM
 - utilizes the URLStreamHandler cache
 - fixes NUTCH-2936 (verified in pseudo-distributed mode)
   
   Also:
   - code improvements in classes of the package "org.apache.nutch.plugin"
 - use `Class` and remove suppressions of warnings
 - javadocs: fix typos
 - remove superfluous white space
 - autoformat using code style template
   - protocol-okhttp: initialize SSLContext not in a static code block 
(SSLContext is used to ignore SSL/TLS certificate verification): this was the 
initial fix for the needless warning in parsechecker even in local mode. This 
seems also fixed by the enforced routing of standard URLStreamHandlers, but I 
left it in, to avoid that all the testing in pseudo-distributed mode needs to 
be run again.
   
   Next week I will test the fixes in real distributed mode.




> Early registration of URL stream handlers provided by plugins may fail Hadoop 
> jobs running in distributed mode if protocol-okhttp is used
> -
>
> Key: NUTCH-2936
> URL: https://issues.apache.org/jira/browse/NUTCH-2936
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed 
> mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: 
> java.io.IOException: Error generating shuffle secret key
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
> at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
> at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not 
> available
> at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177)
> at 
> java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
> ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) 
> in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication 
> job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java]
>  to flag redirects pointing to the same target URL. But I'll try to reproduce 
> it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in 
> distributed mode. Tasks are launched differently, not as NutchJob or 
> NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode

2022-06-15 Thread GitBox


sebastian-nagel opened a new pull request, #733:
URL: https://github.com/apache/nutch/pull/733

   Fixes to address the failing of Nutch jobs in (pseudo-)distributed mode. 
Implements:
   - caching of URLStreamHandlers per protocol to avoid that handlers are 
created anew
   
   - enforce routing of standard protocols (http, https, file, jar) to handlers 
implemented by the JVM
 - utilizes the URLStreamHandler cache
 - fixes NUTCH-2936 (verified in pseudo-distributed mode)
   
   Also:
   - code improvements in classes of the package "org.apache.nutch.plugin"
 - use `Class` and remove suppressions of warnings
 - javadocs: fix typos
 - remove superfluous white space
 - autoformat using code style template
   - protocol-okhttp: initialize SSLContext not in a static code block 
(SSLContext is used to ignore SSL/TLS certificate verification): this was the 
initial fix for the needless warning in parsechecker even in local mode. This 
seems also fixed by the enforced routing of standard URLStreamHandlers, but I 
left it in, to avoid that all the testing in pseudo-distributed mode needs to 
be run again.
   
   Next week I will test the fixes in real distributed mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used

2022-06-15 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554531#comment-17554531
 ] 

Sebastian Nagel commented on NUTCH-2936:


After debugging this: the call by the Hadoop MR Job to initialize the 
KeyGenerator leads twice recursively into Nutch's URLStreamHandlerFactory - 
first for the "http" protocol to create a [NULL_URL 
(http://null.oracle.com/)|https://github.com/openjdk/jdk/blob/0530f4e517be5d5b3ff10be8a0764e564f068c06/src/java.base/share/classes/javax/crypto/JceSecurity.java.template#L246],
 second for the "jar" to load the protocol-okhttp.jar. See the debug log output 
and the stack trace (some lines stripped):

{noformat}
2022-06-14 16:56:59,176 DEBUG plugin.URLStreamHandlerFactory: Registered 
URLStreamHandlerFactory with the JVM.
2022-06-14 16:56:59,994 DEBUG plugin.URLStreamHandlerFactory: Creating 
URLStreamHandler for protocol: http
2022-06-14 16:56:59,994 DEBUG plugin.PluginRepository: Creating 
URLStreamHandler for protocol: http
2022-06-14 16:56:59,995 DEBUG plugin.PluginRepository: Suitable protocolName 
attribute located: http
2022-06-14 16:57:00,007 DEBUG plugin.URLStreamHandlerFactory: Creating 
URLStreamHandler for protocol: jar
2022-06-14 16:57:00,007 DEBUG plugin.PluginRepository: Creating 
URLStreamHandler for protocol: jar
2022-06-14 16:57:00,008 DEBUG plugin.PluginRepository: No suitable protocol 
extensions registered for protocol: jar
2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Located extension 
instance class: org.apache.nutch.protocol.okhttp.OkHttp
2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Suitable protocol 
extension found that did not declare a handler
{noformat}

{noformat}
at 
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:583)
at 
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.base/java.net.URL.getURLStreamHandler(URL.java:1432)
at java.base/java.net.URL.(URL.java:451)
at 
java.base/jdk.internal.loader.URLClassPath$JarLoader.(URLClassPath.java:720)
at 
java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:494)
at 
java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:477)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at 
java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:476)
at 
java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:445)
at 
java.base/jdk.internal.loader.URLClassPath.getResource(URLClassPath.java:314)
at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:455)
at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:452)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:451)
at 
org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:71)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at 
org.apache.nutch.plugin.PluginRepository.getCachedClass(PluginRepository.java:349)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:165)
at 
org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:601)
at 
org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95)
at java.base/java.net.URL.getURLStreamHandler(URL.java:1432)
at java.base/java.net.URL.(URL.java:651)
at java.base/java.net.URL.(URL.java:541)
at java.base/java.net.URL.(URL.java:488)
at java.base/javax.crypto.JceSecurity.(JceSecurity.java:239)
at java.base/javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:363)
at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:176)
at 
java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
...
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1568)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1589)
at org.apache.nutch.crawl.Injector.inject(Injector.java:436)
at org.apache.nutch.crawl.Injector.run(Injector.java:569)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
at org.apache.nutch.crawl.Injector.main(Injector.java:533)
...
{noformat}

I do not understand why the initialization of the KeyGenerator fails only in 
this combination (distributed mode and using protocol-okhttp). Nevertheless, we 
should never delegate standard URLStreamHandlers implemented by the JVM to 
handlers requiring the Nutch plugin system with its complexity and the 
plugin-specific clas