[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840892#comment-17840892 ] ASF GitHub Bot commented on NUTCH-3043: --- lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); Review Comment: Please use parameterized logging. ``` LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe); ``` > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]
lewismc commented on code in PR #814: URL: https://github.com/apache/nutch/pull/814#discussion_r1579883313 ## src/java/org/apache/nutch/crawl/Generator.java: ## @@ -253,10 +256,7 @@ public void map(Text key, CrawlDatum value, Context context) try { sort = scfilters.generatorSortValue(key, crawlDatum, sort); } catch (ScoringFilterException sfe) { -if (LOG.isWarnEnabled()) { - LOG.warn( - "Couldn't filter generatorSortValue for " + key + ": " + sfe); -} +LOG.warn("Couldn't filter generatorSortValue for " + key + ": " + sfe); Review Comment: Please use parameterized logging. ``` LOG.warn("Couldn't filter generatorSortValue for {}: {}”, key, sfe); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840854#comment-17840854 ] ASF GitHub Bot commented on NUTCH-3044: --- sebastian-nagel opened a new pull request, #815: URL: https://github.com/apache/nutch/pull/815 (no comment) > Generator: NPE when extracting the host part of a URL fails > --- > > Key: NUTCH-3044 > URL: https://issues.apache.org/jira/browse/NUTCH-3044 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > When extracting the host part of a URL fails, the Generator job fails because > of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb > contains an malformed URL, for example, a URL with an unsupported scheme > (smb://). > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) > at > org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails
Sebastian Nagel created NUTCH-3044: -- Summary: Generator: NPE when extracting the host part of a URL fails Key: NUTCH-3044 URL: https://issues.apache.org/jira/browse/NUTCH-3044 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Fix For: 1.21 When extracting the host part of a URL fails, the Generator job fails because of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb contains an malformed URL, for example, a URL with an unsupported scheme (smb://). {noformat} Caused by: java.lang.NullPointerException at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439) at org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840845#comment-17840845 ] ASF GitHub Bot commented on NUTCH-3043: --- sebastian-nagel opened a new pull request, #814: URL: https://github.com/apache/nutch/pull/814 - add counters URL_FILTERS_REJECTED and URL_FILTER_EXCEPTION - simplify logging statement - remove unnecessary cast > Generator: count URLs rejected by URL filters > - > > Key: NUTCH-3043 > URL: https://issues.apache.org/jira/browse/NUTCH-3043 > Project: Nutch > Issue Type: Improvement > Components: generator >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.21 > > > Generator already counts URLs rejected by the (re)fetch scheduler, by fetch > interval or status. It should also count the number of URLs rejected by URL > filters. > See also [Generator > metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters
Sebastian Nagel created NUTCH-3043: -- Summary: Generator: count URLs rejected by URL filters Key: NUTCH-3043 URL: https://issues.apache.org/jira/browse/NUTCH-3043 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.20 Reporter: Sebastian Nagel Assignee: Sebastian Nagel Fix For: 1.21 Generator already counts URLs rejected by the (re)fetch scheduler, by fetch interval or status. It should also count the number of URLs rejected by URL filters. See also [Generator metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] Consolidating Nutch Continuous Integration
A better reference for the GitHub Actions can be found at https://github.com/apache/nutch/actions lewismc On 2024/04/25 14:40:35 lewis john mcgibbney wrote: > Hi dev@, > > We currently maintains a combination of Jenkins [0] and GitHub Actions [1] > for CI. > > For the longest time, we relied solely on Jenkins. This was really useful > particularly when committers were pulling build artifacts from Jenkins > nightly and relied on SVN trunk being stable. The Jenkins job used to be > run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT > builds were turned off. > > In 2020 we accepted a pull request [2] which established GitHub Actions and > since then have gradually added small but important updates to the GitHub > Actions workflow [3]. > > I can elaborate on the details of what each CI workflow does (it is not > overly complex) but before I do that, is there any preference on choosing > one (Jenkins Vs GitHub Actions) over the other? > > Thanks > > lewismc > > [0] https://ci-builds.apache.org/job/Nutch/ > [1] > https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml > [2] > https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06 > [3] > https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml > > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc >
[DISCUSS] Consolidating Nutch Continuous Integration
Hi dev@, We currently maintains a combination of Jenkins [0] and GitHub Actions [1] for CI. For the longest time, we relied solely on Jenkins. This was really useful particularly when committers were pulling build artifacts from Jenkins nightly and relied on SVN trunk being stable. The Jenkins job used to be run nightly but no longer is. It is not clear exactly when nightly SNAPSHOT builds were turned off. In 2020 we accepted a pull request [2] which established GitHub Actions and since then have gradually added small but important updates to the GitHub Actions workflow [3]. I can elaborate on the details of what each CI workflow does (it is not overly complex) but before I do that, is there any preference on choosing one (Jenkins Vs GitHub Actions) over the other? Thanks lewismc [0] https://ci-builds.apache.org/job/Nutch/ [1] https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml [2] https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06 [3] https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc