[jira] [Commented] (NUTCH-2812) Methods returning array may expose internal representation
[ https://issues.apache.org/jira/browse/NUTCH-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784299#comment-17784299 ] ASF GitHub Bot commented on NUTCH-2812: --- GabeHaegele opened a new pull request, #798: URL: https://github.com/apache/nutch/pull/798 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! > Methods returning array may expose internal representation > -- > > Key: NUTCH-2812 > URL: https://issues.apache.org/jira/browse/NUTCH-2812 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.17 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Returning a reference to a mutable object value stored in one of the object's > fields exposes the internal representation of the object. If instances are > accessed by untrusted code, and unchecked changes to the mutable object would > compromise security or other important properties, you will need to do > something different. Returning a new copy of the object is better approach in > many situations. > For example org.apache.nutch.fetcher.FetchNode.getOutlinks() may expose > internal representation by returning FetchNode.outlinks > There are 11 such occurrences of this bug in the codebase. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] fix for NUTCH-2812 contributed by GabeHaegele [nutch]
GabeHaegele opened a new pull request, #798: URL: https://github.com/apache/nutch/pull/798 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784204#comment-17784204 ] Hudson commented on NUTCH-3025: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #142 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/142/]) [NUTCH-3025^Curlfilter-fast to filter based on the length of the URL (julien: [https://github.com/apache/nutch/commit/d8e66ce87328ce4bb14b0da9516faf8a9f63f818]) * (edit) src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java * (edit) src/plugin/urlfilter-fast/README.md * (edit) src/plugin/urlfilter-fast/src/test/org/apache/nutch/urlfilter/fast/TestFastURLFilter.java > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3025. Resolution: Implemented > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3025: --- Component/s: plugin urlfilter > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784186#comment-17784186 ] ASF GitHub Bot commented on NUTCH-3025: --- sebastian-nagel merged PR #796: URL: https://github.com/apache/nutch/pull/796 > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784187#comment-17784187 ] ASF GitHub Bot commented on NUTCH-3025: --- sebastian-nagel commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264 Thanks, @jnioche! > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
sebastian-nagel commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264 Thanks, @jnioche! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
sebastian-nagel merged PR #796: URL: https://github.com/apache/nutch/pull/796 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784057#comment-17784057 ] ASF GitHub Bot commented on NUTCH-3025: --- jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355 @sebastian-nagel merged the changes from master and made a few improvements > urlfilter-fast to filter based on the length of the URL > --- > > Key: NUTCH-3025 > URL: https://issues.apache.org/jira/browse/NUTCH-3025 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Major > Fix For: 1.20 > > > There currently is no filter implementation to remove URLs based on their > length or the length of their path / query. > Doing so with the regex filter would be inefficient, instead we could > implement it in _urlfilter-fast _ -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355 @sebastian-nagel merged the changes from master and made a few improvements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784047#comment-17784047 ] Hudson commented on NUTCH-3017: --- SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #141 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/141/]) [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input (julien: [https://github.com/apache/nutch/commit/d1025fd634e79f2f384131ca2776f346aa446902]) * (edit) src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input (snagel: [https://github.com/apache/nutch/commit/ac383fc5125b6c114a23ef996558ead57e873970]) * (edit) src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java * (edit) conf/nutch-default.xml > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030 ] Sebastian Nagel commented on NUTCH-3017: Thanks, [~jnioche] > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3017. Resolution: Implemented > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784029#comment-17784029 ] ASF GitHub Bot commented on NUTCH-3017: --- sebastian-nagel commented on PR #793: URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549 Thanks, @jnioche! Merged into master, adding the lines to make use of Hadoop-provided compression codecs. Successfully tested in local and pseudo-distributed mode with various codecs (gzip / .gz, bzip2, ZStandard / .zst). One final note: if the fast-urlfilter is not found, the Nutch job (local mode) or the tasks (distributed mode) fail with an exception. I didn't change this behavior. > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]
sebastian-nagel commented on PR #793: URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549 Thanks, @jnioche! Merged into master, adding the lines to make use of Hadoop-provided compression codecs. Successfully tested in local and pseudo-distributed mode with various codecs (gzip / .gz, bzip2, ZStandard / .zst). One final note: if the fast-urlfilter is not found, the Nutch job (local mode) or the tasks (distributed mode) fail with an exception. I didn't change this behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784024#comment-17784024 ] ASF GitHub Bot commented on NUTCH-3017: --- sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 URL: https://github.com/apache/nutch/pull/793 > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > --- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter >Affects Versions: 1.19 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]
sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 URL: https://github.com/apache/nutch/pull/793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org