[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on code in PR #814:
URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313


##
src/java/org/apache/nutch/crawl/Generator.java:
##
@@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context 
context)
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);

Review Comment:
   Please use parameterized logging.
   ```
   LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe);
   ```





> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-25 Thread via GitHub


lewismc commented on code in PR #814:
URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313


##
src/java/org/apache/nutch/crawl/Generator.java:
##
@@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context 
context)
   try {
 sort = scfilters.generatorSortValue(key, crawlDatum, sort);
   } catch (ScoringFilterException sfe) {
-if (LOG.isWarnEnabled()) {
-  LOG.warn(
-  "Couldn't filter generatorSortValue for " + key + ": " + sfe);
-}
+LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe);

Review Comment:
   Please use parameterized logging.
   ```
   LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840854#comment-17840854
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

sebastian-nagel opened a new pull request, #815:
URL: https://github.com/apache/nutch/pull/815

   (no comment)




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3044:
--

 Summary: Generator: NPE when extracting the host part of a URL 
fails
 Key: NUTCH-3044
 URL: https://issues.apache.org/jira/browse/NUTCH-3044
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.21


When extracting the host part of a URL fails, the Generator job fails because 
of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
contains an malformed URL, for example, a URL with an unsupported scheme 
(smb://).

{noformat}
Caused by: java.lang.NullPointerException
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
  at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

sebastian-nagel opened a new pull request, #814:
URL: https://github.com/apache/nutch/pull/814

   - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION
   - simplify logging statement
   - remove unnecessary cast




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3043:
--

 Summary: Generator: count URLs rejected by URL filters
 Key: NUTCH-3043
 URL: https://issues.apache.org/jira/browse/NUTCH-3043
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.20
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
 Fix For: 1.21


Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
interval or status. It should also count the number of URLs rejected by URL 
filters.

See also [Generator 
metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-25 Thread Lewis John McGibbney
A better reference for the GitHub Actions can be found at 
https://github.com/apache/nutch/actions

lewismc

On 2024/04/25 14:40:35 lewis john mcgibbney wrote:
> Hi dev@,
> 
> We currently maintains a combination of Jenkins [0] and GitHub Actions [1]
> for CI.
> 
> For the longest time, we relied solely on Jenkins. This was really useful
> particularly when committers were pulling build artifacts from Jenkins
> nightly and relied on SVN trunk being stable. The Jenkins job used to be
> run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT
> builds were turned off.
> 
> In 2020 we accepted a pull request [2] which established GitHub Actions and
> since then have gradually added small but important updates to the GitHub
> Actions workflow [3].
> 
> I can elaborate on the details of what each CI workflow does (it is not
> overly complex) but before I do that, is there any preference on choosing
> one (Jenkins Vs GitHub Actions) over the other?
> 
> Thanks
> 
> lewismc
> 
> [0] https://ci-builds.apache.org/job/Nutch/
> [1]
> https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml
> [2]
> https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06
> [3]
> https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml
> 
> -- 
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
> 


[DISCUSS] Consolidating Nutch Continuous Integration

2024-04-25 Thread lewis john mcgibbney
Hi dev@,

We currently maintains a combination of Jenkins [0] and GitHub Actions [1]
for CI.

For the longest time, we relied solely on Jenkins. This was really useful
particularly when committers were pulling build artifacts from Jenkins
nightly and relied on SVN trunk being stable. The Jenkins job used to be
run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT
builds were turned off.

In 2020 we accepted a pull request [2] which established GitHub Actions and
since then have gradually added small but important updates to the GitHub
Actions workflow [3].

I can elaborate on the details of what each CI workflow does (it is not
overly complex) but before I do that, is there any preference on choosing
one (Jenkins Vs GitHub Actions) over the other?

Thanks

lewismc

[0] https://ci-builds.apache.org/job/Nutch/
[1]
https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml
[2]
https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06
[3]
https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc