[jira] [Commented] (NUTCH-2812) Methods returning array may expose internal representation

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784299#comment-17784299
 ] 

ASF GitHub Bot commented on NUTCH-2812:
---

GabeHaegele opened a new pull request, #798:
URL: https://github.com/apache/nutch/pull/798

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!




> Methods returning array may expose internal representation
> --
>
> Key: NUTCH-2812
> URL: https://issues.apache.org/jira/browse/NUTCH-2812
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.17
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Returning a reference to a mutable object value stored in one of the object's 
> fields exposes the internal representation of the object.  If instances are 
> accessed by untrusted code, and unchecked changes to the mutable object would 
> compromise security or other important properties, you will need to do 
> something different. Returning a new copy of the object is better approach in 
> many situations.
> For example org.apache.nutch.fetcher.FetchNode.getOutlinks() may expose 
> internal representation by returning FetchNode.outlinks
> There are 11 such occurrences of this bug in the codebase. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] fix for NUTCH-2812 contributed by GabeHaegele [nutch]

2023-11-08 Thread via GitHub


GabeHaegele opened a new pull request, #798:
URL: https://github.com/apache/nutch/pull/798

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784204#comment-17784204
 ] 

Hudson commented on NUTCH-3025:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #142 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/142/])
[NUTCH-3025^Curlfilter-fast to filter based on the length of the URL (julien: 
[https://github.com/apache/nutch/commit/d8e66ce87328ce4bb14b0da9516faf8a9f63f818])
* (edit) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
* (edit) src/plugin/urlfilter-fast/README.md
* (edit) 
src/plugin/urlfilter-fast/src/test/org/apache/nutch/urlfilter/fast/TestFastURLFilter.java


> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3025.

Resolution: Implemented

> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3025:
---
Component/s: plugin
 urlfilter

> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784186#comment-17784186
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---

sebastian-nagel merged PR #796:
URL: https://github.com/apache/nutch/pull/796




> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784187#comment-17784187
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---

sebastian-nagel commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264

   Thanks, @jnioche!




> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1802531264

   Thanks, @jnioche!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel merged PR #796:
URL: https://github.com/apache/nutch/pull/796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784057#comment-17784057
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---

jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355

   @sebastian-nagel merged the changes from master and made a few improvements




> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-08 Thread via GitHub


jnioche commented on PR #796:
URL: https://github.com/apache/nutch/pull/796#issuecomment-1801938355

   @sebastian-nagel merged the changes from master and made a few improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784047#comment-17784047
 ] 

Hudson commented on NUTCH-3017:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #141 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/141/])
[NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped 
input (julien: 
[https://github.com/apache/nutch/commit/d1025fd634e79f2f384131ca2776f346aa446902])
* (edit) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
[NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 and support gzipped 
input (snagel: 
[https://github.com/apache/nutch/commit/ac383fc5125b6c114a23ef996558ead57e873970])
* (edit) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
* (edit) conf/nutch-default.xml


> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030
 ] 

Sebastian Nagel commented on NUTCH-3017:


Thanks, [~jnioche]

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-3017.

Resolution: Implemented

> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784029#comment-17784029
 ] 

ASF GitHub Bot commented on NUTCH-3017:
---

sebastian-nagel commented on PR #793:
URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549

   Thanks, @jnioche!
   
   Merged into master, adding the lines to make use of Hadoop-provided 
compression codecs.
   
   Successfully tested in local and pseudo-distributed mode with various codecs 
(gzip / .gz, bzip2, ZStandard / .zst).
   
   One final note: if the fast-urlfilter is not found, the Nutch job (local 
mode) or the tasks (distributed mode) fail with an exception. I didn't change 
this behavior.




> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel commented on PR #793:
URL: https://github.com/apache/nutch/pull/793#issuecomment-1801814549

   Thanks, @jnioche!
   
   Merged into master, adding the lines to make use of Hadoop-provided 
compression codecs.
   
   Successfully tested in local and pseudo-distributed mode with various codecs 
(gzip / .gz, bzip2, ZStandard / .zst).
   
   One final note: if the fast-urlfilter is not found, the Nutch job (local 
mode) or the tasks (distributed mode) fail with an exception. I didn't change 
this behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784024#comment-17784024
 ] 

ASF GitHub Bot commented on NUTCH-3017:
---

sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to 
load from HDFS/S3 
URL: https://github.com/apache/nutch/pull/793




> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> ---
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, urlfilter
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-11-08 Thread via GitHub


sebastian-nagel closed pull request #793: [NUTCH-3017] Allow fast-urlfilter to 
load from HDFS/S3 
URL: https://github.com/apache/nutch/pull/793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org