[jira] [Commented] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554866#comment-17554866 ] Lewis John McGibbney commented on NUTCH-2940: - WIP PR available at https://github.com/apache/nutch/pull/735 > Develop Gradle Core Build for Apache Nutch > -- > > Key: NUTCH-2940 > URL: https://issues.apache.org/jira/browse/NUTCH-2940 > Project: Nutch > Issue Type: Sub-task > Components: build >Reporter: James Simmons >Assignee: Lewis John McGibbney >Priority: Major > > This issue will focus on the build lifecycle management for the core build of > Apache Nutch as seen here: > [https://github.com/apache/nutch/tree/master/src/java|https://github.com/apache/nutch/tree/master/src/java] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index files not working
[ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554865#comment-17554865 ] ASF GitHub Bot commented on NUTCH-2490: --- lewismc commented on PR #735: URL: https://github.com/apache/nutch/pull/735#issuecomment-1157159803 I'll squash and merge commits into one when we are ready to merge into `master` branch. > Sitemap processing: Sitemap index files not working > --- > > Key: NUTCH-2490 > URL: https://issues.apache.org/jira/browse/NUTCH-2490 > Project: Nutch > Issue Type: Bug >Reporter: Moreno Feltscher >Assignee: Moreno Feltscher >Priority: Major > Fix For: 1.15 > > > The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] > does not properly handle sitemap index files due to a unnecessary conditional. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [nutch] lewismc commented on pull request #735: NUTCH-2490 Develop Gradle Core Build for Apache Nutch
lewismc commented on PR #735: URL: https://github.com/apache/nutch/pull/735#issuecomment-1157159803 I'll squash and merge commits into one when we are ready to merge into `master` branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nutch] lewismc opened a new pull request, #735: Nutch 2940
lewismc opened a new pull request, #735: URL: https://github.com/apache/nutch/pull/735 This is a WIP for https://issues.apache.org/jira/browse/NUTCH-2940. The work was conducted by @AzureTriple @imanzanganeh @jbsimmon @LilyPerr and @Lirongxuan1 from the 2022 USC Senior CS Capstone Program. Most of the core build is in place. No plugin sub-projects have been implemented yet. I intend to continue work on the code build until it is completed. I will then move on to the plugin sub-projects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (NUTCH-2940) Develop Gradle Core Build for Apache Nutch
[ https://issues.apache.org/jira/browse/NUTCH-2940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2940: --- Assignee: Lewis John McGibbney > Develop Gradle Core Build for Apache Nutch > -- > > Key: NUTCH-2940 > URL: https://issues.apache.org/jira/browse/NUTCH-2940 > Project: Nutch > Issue Type: Sub-task > Components: build >Reporter: James Simmons >Assignee: Lewis John McGibbney >Priority: Major > > This issue will focus on the build lifecycle management for the core build of > Apache Nutch as seen here: > [https://github.com/apache/nutch/tree/master/src/java|https://github.com/apache/nutch/tree/master/src/java] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554859#comment-17554859 ] ASF GitHub Bot commented on NUTCH-2936: --- lewismc commented on PR #733: URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313 This is exciting!!! Excellent debugging 👍 ... you got further than me. I can't get around to testing it until next week at earliest. Thinking back, I did observe revisits (recursive access) to URLStreamHandlerFactory but didn't pursue that line of inquiry at that point in time. To get a bit more context I did review [HADOOP-14598-005.patch](https://issues.apache.org/jira/secure/attachment/12880380/HADOOP-14598-005.patch) and the current class it affects. Reading the code it makes more sense but admittedly until I debug this I still don't have the full context. I took a look at [hadoop-hdfs TestUrlStreamHandler.java](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestUrlStreamHandler.java) as well which I really like the look of. To build out some more confidence in this aspect of the codebase, we could create some tests for the [nutch URLStreamHandlerFactory.java](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java). > Early registration of URL stream handlers provided by plugins may fail Hadoop > jobs running in distributed mode if protocol-okhttp is used > - > > Key: NUTCH-2936 > URL: https://issues.apache.org/jira/browse/NUTCH-2936 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.19 > > > After merging NUTCH-2429 I've observed that Nutch jobs running in distributed > mode may fail early with the following dubious error: > {noformat} > 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: > java.io.IOException: Error generating shuffle secret key > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583) > at > org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not > available > at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177) > at > java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) > ... 16 more > {noformat} > After removing the early registration of URL stream handlers (see NUTCH-2429) > in NutchJob and NutchTool, the job starts without errors. > Notes: > - the job this error was observed a [custom de-duplication > job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] > to flag redirects pointing to the same target URL. But I'll try to reproduce > it with a standard Nutch job and in pseudo-distributed mode. > - should also verify whether registering URL stream handlers works at all in > distributed mode. Tasks are launched differently, not as NutchJob or > NutchTool. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [nutch] lewismc commented on pull request #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode
lewismc commented on PR #733: URL: https://github.com/apache/nutch/pull/733#issuecomment-1157149313 This is exciting!!! Excellent debugging 👍 ... you got further than me. I can't get around to testing it until next week at earliest. Thinking back, I did observe revisits (recursive access) to URLStreamHandlerFactory but didn't pursue that line of inquiry at that point in time. To get a bit more context I did review [HADOOP-14598-005.patch](https://issues.apache.org/jira/secure/attachment/12880380/HADOOP-14598-005.patch) and the current class it affects. Reading the code it makes more sense but admittedly until I debug this I still don't have the full context. I took a look at [hadoop-hdfs TestUrlStreamHandler.java](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestUrlStreamHandler.java) as well which I really like the look of. To build out some more confidence in this aspect of the codebase, we could create some tests for the [nutch URLStreamHandlerFactory.java](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/plugin/URLStreamHandlerFactory.java). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554702#comment-17554702 ] Sebastian Nagel commented on NUTCH-2936: Update: the issue is reproducible also in local mode after upgrading to a more recent Hadoop (3.3.3, see [PR #734|https://github.com/apache/nutch/pull/734]) and using protocol-okhttp. > Early registration of URL stream handlers provided by plugins may fail Hadoop > jobs running in distributed mode if protocol-okhttp is used > - > > Key: NUTCH-2936 > URL: https://issues.apache.org/jira/browse/NUTCH-2936 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.19 > > > After merging NUTCH-2429 I've observed that Nutch jobs running in distributed > mode may fail early with the following dubious error: > {noformat} > 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: > java.io.IOException: Error generating shuffle secret key > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583) > at > org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not > available > at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177) > at > java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) > ... 16 more > {noformat} > After removing the early registration of URL stream handlers (see NUTCH-2429) > in NutchJob and NutchTool, the job starts without errors. > Notes: > - the job this error was observed a [custom de-duplication > job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] > to flag redirects pointing to the same target URL. But I'll try to reproduce > it with a standard Nutch job and in pseudo-distributed mode. > - should also verify whether registering URL stream handlers works at all in > distributed mode. Tasks are launched differently, not as NutchJob or > NutchTool. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554701#comment-17554701 ] ASF GitHub Bot commented on NUTCH-2952: --- sebastian-nagel commented on PR #734: URL: https://github.com/apache/nutch/pull/734#issuecomment-1156706013 Update: the failing unit test (TestCrawlDbDeduplication) on my development system stem from a modified nutch-site.xml requesting protocol-okhttp - obviously, it's the combination of using protocol-okhttp and a more recent Hadoop version which triggers the appearance of NUTCH-2936: the issue (NoSuchAlgorithmException) is then reproducible also in local mode. > Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) > -- > > Key: NUTCH-2952 > URL: https://issues.apache.org/jira/browse/NUTCH-2952 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some > more. > - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces > full support for Java 11 and ARM architectures -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [nutch] sebastian-nagel commented on pull request #734: NUTCH-2952 Upgrade core dependencies
sebastian-nagel commented on PR #734: URL: https://github.com/apache/nutch/pull/734#issuecomment-1156706013 Update: the failing unit test (TestCrawlDbDeduplication) on my development system stem from a modified nutch-site.xml requesting protocol-okhttp - obviously, it's the combination of using protocol-okhttp and a more recent Hadoop version which triggers the appearance of NUTCH-2936: the issue (NoSuchAlgorithmException) is then reproducible also in local mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554663#comment-17554663 ] ASF GitHub Bot commented on NUTCH-2952: --- sebastian-nagel opened a new pull request, #734: URL: https://github.com/apache/nutch/pull/734 Upgrade of core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more Note: I've observed that some unit tests are failing with same/similar errors than observed in NUTCH-2936. Eventually, this PR needs to be rebased on top of #733. > Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) > -- > > Key: NUTCH-2952 > URL: https://issues.apache.org/jira/browse/NUTCH-2952 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some > more. > - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces > full support for Java 11 and ARM architectures -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [nutch] sebastian-nagel opened a new pull request, #734: NUTCH-2952 Upgrade core dependencies
sebastian-nagel opened a new pull request, #734: URL: https://github.com/apache/nutch/pull/734 Upgrade of core dependencies - Hadoop 3.1.3 -> 3.3.3 - log4j 2.17.0 -> 2.17.2 - and some more Note: I've observed that some unit tests are failing with same/similar errors than observed in NUTCH-2936. Eventually, this PR needs to be rebased on top of #733. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
[ https://issues.apache.org/jira/browse/NUTCH-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2952: -- Assignee: Sebastian Nagel > Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) > -- > > Key: NUTCH-2952 > URL: https://issues.apache.org/jira/browse/NUTCH-2952 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some > more. > - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces > full support for Java 11 and ARM architectures -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (NUTCH-2952) Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2)
Sebastian Nagel created NUTCH-2952: -- Summary: Upgrade core dependencies (Hadoop 3.3.3, log4j 2.17.2) Key: NUTCH-2952 URL: https://issues.apache.org/jira/browse/NUTCH-2952 Project: Nutch Issue Type: Improvement Affects Versions: 1.18 Reporter: Sebastian Nagel Fix For: 1.19 Upgrade the core dependencies to Hadoop 3.3.3 and log4j 2.17.2 - and some more. - [Hadoop 3.3.3|https://hadoop.apache.org/docs/r3.3.3/index.html] announces full support for Java 11 and ARM architectures -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (NUTCH-2949) Tasks of a multi-threaded map runner may fail because of slow creation of URL stream handlers
[ https://issues.apache.org/jira/browse/NUTCH-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554564#comment-17554564 ] Sebastian Nagel commented on NUTCH-2949: This is addressed in [PR#733|https://github.com/apache/nutch/pull/733]. > Tasks of a multi-threaded map runner may fail because of slow creation of URL > stream handlers > - > > Key: NUTCH-2949 > URL: https://issues.apache.org/jira/browse/NUTCH-2949 > Project: Nutch > Issue Type: Bug > Components: net, plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Priority: Blocker > Fix For: 1.19 > > > While running a custom Nutch job ([code > here|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/SitemapInjector.java]), > many but not all task failed exceeding the the Hadoop task time-out > (`mapreduce.task.timeout`) without generating any "heartbeat" (output, > counter increments, log messages). Hadoop logs the stacks of all threads of > the timed out task. That's the base for the excerpts below. > The job runs a MultithreadedMapper - most of the mapper threads (48 in total) > are waiting for the URLStreamHandler in order to construct a java.net.URL > object: > {noformat} > "Thread-11" #27 prio=5 os_prio=0 cpu=243.78ms elapsed=647.25s > tid=0x7f3eb5b0f800 nid=0x8e651 waiting for monitor entry > [0x7f3e84ef9000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) > - waiting to lock <0x0006a1bc0630> (a java.lang.String) > at > org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597) > at > org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) > at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432) > at java.net.URL.(java.base@11.0.15/URL.java:651) > at java.net.URL.(java.base@11.0.15/URL.java:541) > at java.net.URL.(java.base@11.0.15/URL.java:488) > at > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179) > at > org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318) > at > org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157) > at > org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.getContent(SitemapInjector.java:670) > at > org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper$SitemapProcessor.process(SitemapInjector.java:439) > at > org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:325) > at > org.apache.nutch.crawl.SitemapInjector$SitemapInjectMapper.map(SitemapInjector.java:145) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at > org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:274) > {noformat} > Only a single mapper thread is active: > {noformat} > "Thread-23" #39 prio=5 os_prio=0 cpu=5830.17ms elapsed=647.09s > tid=0x7f3eb5b42800 nid=0x8e661 in Object.wait() [0x7f3e842ec000] >java.lang.Thread.State: RUNNABLE > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(java.base@11.0.15/Native > Method) > at > jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(java.base@11.0.15/NativeConstructorAccessorImpl.java:62) > at > jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(java.base@11.0.15/DelegatingConstructorAccessorImpl.java:45) > at > java.lang.reflect.Constructor.newInstance(java.base@11.0.15/Constructor.java:490) > at > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170) > - locked <0x0006a1bc0630> (a java.lang.String) > at > org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597) > at > org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) > at java.net.URL.getURLStreamHandler(java.base@11.0.15/URL.java:1432) > at java.net.URL.(java.base@11.0.15/URL.java:651) > at java.net.URL.(java.base@11.0.15/URL.java:541) > at java.net.URL.(java.base@11.0.15/URL.java:488) > at > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:179) > at > org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:318) > at > org.apache.nutch.crawl.Injector$InjectMapper.filterNormalize(Injector.java:157) > at > org.apache.nutch.crawl
[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554563#comment-17554563 ] ASF GitHub Bot commented on NUTCH-2936: --- sebastian-nagel opened a new pull request, #733: URL: https://github.com/apache/nutch/pull/733 Fixes to address the failing of Nutch jobs in (pseudo-)distributed mode. Implements: - caching of URLStreamHandlers per protocol to avoid that handlers are created anew - enforce routing of standard protocols (http, https, file, jar) to handlers implemented by the JVM - utilizes the URLStreamHandler cache - fixes NUTCH-2936 (verified in pseudo-distributed mode) Also: - code improvements in classes of the package "org.apache.nutch.plugin" - use `Class` and remove suppressions of warnings - javadocs: fix typos - remove superfluous white space - autoformat using code style template - protocol-okhttp: initialize SSLContext not in a static code block (SSLContext is used to ignore SSL/TLS certificate verification): this was the initial fix for the needless warning in parsechecker even in local mode. This seems also fixed by the enforced routing of standard URLStreamHandlers, but I left it in, to avoid that all the testing in pseudo-distributed mode needs to be run again. Next week I will test the fixes in real distributed mode. > Early registration of URL stream handlers provided by plugins may fail Hadoop > jobs running in distributed mode if protocol-okhttp is used > - > > Key: NUTCH-2936 > URL: https://issues.apache.org/jira/browse/NUTCH-2936 > Project: Nutch > Issue Type: Bug > Components: plugin, protocol >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.19 > > > After merging NUTCH-2429 I've observed that Nutch jobs running in distributed > mode may fail early with the following dubious error: > {noformat} > 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: > java.io.IOException: Error generating shuffle secret key > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565) > at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583) > at > org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not > available > at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:177) > at > java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) > ... 16 more > {noformat} > After removing the early registration of URL stream handlers (see NUTCH-2429) > in NutchJob and NutchTool, the job starts without errors. > Notes: > - the job this error was observed a [custom de-duplication > job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] > to flag redirects pointing to the same target URL. But I'll try to reproduce > it with a standard Nutch job and in pseudo-distributed mode. > - should also verify whether registering URL stream handlers works at all in > distributed mode. Tasks are launched differently, not as NutchJob or > NutchTool. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[GitHub] [nutch] sebastian-nagel opened a new pull request, #733: NUTCH-2936 / NUTCH-2949 URLStreamHandler may fail jobs in distributed mode
sebastian-nagel opened a new pull request, #733: URL: https://github.com/apache/nutch/pull/733 Fixes to address the failing of Nutch jobs in (pseudo-)distributed mode. Implements: - caching of URLStreamHandlers per protocol to avoid that handlers are created anew - enforce routing of standard protocols (http, https, file, jar) to handlers implemented by the JVM - utilizes the URLStreamHandler cache - fixes NUTCH-2936 (verified in pseudo-distributed mode) Also: - code improvements in classes of the package "org.apache.nutch.plugin" - use `Class` and remove suppressions of warnings - javadocs: fix typos - remove superfluous white space - autoformat using code style template - protocol-okhttp: initialize SSLContext not in a static code block (SSLContext is used to ignore SSL/TLS certificate verification): this was the initial fix for the needless warning in parsechecker even in local mode. This seems also fixed by the enforced routing of standard URLStreamHandlers, but I left it in, to avoid that all the testing in pseudo-distributed mode needs to be run again. Next week I will test the fixes in real distributed mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode if protocol-okhttp is used
[ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554531#comment-17554531 ] Sebastian Nagel commented on NUTCH-2936: After debugging this: the call by the Hadoop MR Job to initialize the KeyGenerator leads twice recursively into Nutch's URLStreamHandlerFactory - first for the "http" protocol to create a [NULL_URL (http://null.oracle.com/)|https://github.com/openjdk/jdk/blob/0530f4e517be5d5b3ff10be8a0764e564f068c06/src/java.base/share/classes/javax/crypto/JceSecurity.java.template#L246], second for the "jar" to load the protocol-okhttp.jar. See the debug log output and the stack trace (some lines stripped): {noformat} 2022-06-14 16:56:59,176 DEBUG plugin.URLStreamHandlerFactory: Registered URLStreamHandlerFactory with the JVM. 2022-06-14 16:56:59,994 DEBUG plugin.URLStreamHandlerFactory: Creating URLStreamHandler for protocol: http 2022-06-14 16:56:59,994 DEBUG plugin.PluginRepository: Creating URLStreamHandler for protocol: http 2022-06-14 16:56:59,995 DEBUG plugin.PluginRepository: Suitable protocolName attribute located: http 2022-06-14 16:57:00,007 DEBUG plugin.URLStreamHandlerFactory: Creating URLStreamHandler for protocol: jar 2022-06-14 16:57:00,007 DEBUG plugin.PluginRepository: Creating URLStreamHandler for protocol: jar 2022-06-14 16:57:00,008 DEBUG plugin.PluginRepository: No suitable protocol extensions registered for protocol: jar 2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Located extension instance class: org.apache.nutch.protocol.okhttp.OkHttp 2022-06-14 16:57:00,320 DEBUG plugin.PluginRepository: Suitable protocol extension found that did not declare a handler {noformat} {noformat} at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:583) at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) at java.base/java.net.URL.getURLStreamHandler(URL.java:1432) at java.base/java.net.URL.(URL.java:451) at java.base/jdk.internal.loader.URLClassPath$JarLoader.(URLClassPath.java:720) at java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:494) at java.base/jdk.internal.loader.URLClassPath$3.run(URLClassPath.java:477) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:476) at java.base/jdk.internal.loader.URLClassPath.getLoader(URLClassPath.java:445) at java.base/jdk.internal.loader.URLClassPath.getResource(URLClassPath.java:314) at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:455) at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:452) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:451) at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:71) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) at org.apache.nutch.plugin.PluginRepository.getCachedClass(PluginRepository.java:349) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:165) at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:601) at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) at java.base/java.net.URL.getURLStreamHandler(URL.java:1432) at java.base/java.net.URL.(URL.java:651) at java.base/java.net.URL.(URL.java:541) at java.base/java.net.URL.(URL.java:488) at java.base/javax.crypto.JceSecurity.(JceSecurity.java:239) at java.base/javax.crypto.KeyGenerator.nextSpi(KeyGenerator.java:363) at java.base/javax.crypto.KeyGenerator.(KeyGenerator.java:176) at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179) ... at org.apache.hadoop.mapreduce.Job.submit(Job.java:1568) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1589) at org.apache.nutch.crawl.Injector.inject(Injector.java:436) at org.apache.nutch.crawl.Injector.run(Injector.java:569) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) at org.apache.nutch.crawl.Injector.main(Injector.java:533) ... {noformat} I do not understand why the initialization of the KeyGenerator fails only in this combination (distributed mode and using protocol-okhttp). Nevertheless, we should never delegate standard URLStreamHandlers implemented by the JVM to handlers requiring the Nutch plugin system with its complexity and the plugin-specific clas