[jira] [Created] (NUTCH-3046) Use compact strings

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3046:
---

 Summary: Use compact strings
 Key: NUTCH-3046
 URL: https://issues.apache.org/jira/browse/NUTCH-3046
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3045:
---

 Summary: Upgrade from Java 11 to 17
 Key: NUTCH-3045
 URL: https://issues.apache.org/jira/browse/NUTCH-3045
 Project: Nutch
  Issue Type: Task
  Components: build, ci/cd
Reporter: Lewis John McGibbney
 Fix For: 1.21


This parent issue will track and organize work pertaining to upgrading Nutch to 
JDK 17.

Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] Apache Nutch 1.20 Release

2024-04-28 Thread lewis john mcgibbney
The Apache Nutch Project https://nutch.apache.org/download/

Please verify signatures using the KEYS file
https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading
the release.

This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in the Jira release report
https://s.apache.org/ovjf3

Thanks to everyone who contributed to this release!

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Commented] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841682#comment-17841682
 ] 

ASF GitHub Bot commented on NUTCH-3044:
---

lewismc commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107

   Excellent @sebastian-nagel +1




> Generator: NPE when extracting the host part of a URL fails
> ---
>
> Key: NUTCH-3044
> URL: https://issues.apache.org/jira/browse/NUTCH-3044
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> When extracting the host part of a URL fails, the Generator job fails because 
> of a NPE in the SelectorReducer. This issue is reproducible if the CrawlDb 
> contains an malformed URL, for example, a URL with an unsupported scheme 
> (smb://).
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:439)
>   at 
> org.apache.nutch.crawl.Generator$SelectorReducer.reduce(Generator.java:300)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3044 Generator: NPE when extracting the host part of a URL fails [nutch]

2024-04-28 Thread via GitHub


lewismc commented on PR #815:
URL: https://github.com/apache/nutch/pull/815#issuecomment-2081564107

   Excellent @sebastian-nagel +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841681#comment-17841681
 ] 

ASF GitHub Bot commented on NUTCH-3043:
---

lewismc commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229

   Excellent @sebastian-nagel 




> Generator: count URLs rejected by URL filters
> -
>
> Key: NUTCH-3043
> URL: https://issues.apache.org/jira/browse/NUTCH-3043
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.21
>
>
> Generator already counts URLs rejected by the (re)fetch scheduler, by fetch 
> interval or status. It should also count the number of URLs rejected by URL 
> filters.
> See also [Generator 
> metrics|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Generator].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3043 Generator: count URLs rejected by URL filters [nutch]

2024-04-28 Thread via GitHub


lewismc commented on PR #814:
URL: https://github.com/apache/nutch/pull/814#issuecomment-2081563229

   Excellent @sebastian-nagel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-28 Thread Sebastian Nagel

Hi Lewis,

> The Jenkins job used to be run nightly but
> no longer is.

It pulls nightly from git:
  https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/scmPollLog/
but a build is only run if there are new commits. The latest one:
  https://lists.apache.org/thread/ywtlmdmckhd21c6y9c77z01q17h42jww

Of course, we could add nightly builds on Github, in addition to the
builds when pull requests are opened.

> is there any preference on choosing one (Jenkins
> Vs GitHub Actions) over the other?

From my side: no. It may not harm to have both.

Best,
Sebastian

On 4/25/24 16:40, lewis john mcgibbney wrote:

Hi dev@,

We currently maintains a combination of Jenkins [0] and GitHub Actions [1] for 
CI.

For the longest time, we relied solely on Jenkins. This was really useful 
particularly when committers were pulling build artifacts from Jenkins nightly 
and relied on SVN trunk being stable. The Jenkins job used to be run nightly but 
no longer is. It is not clear exactly when nightly SNAPSHOT builds were turned off.


In 2020 we accepted a pull request [2] which established GitHub Actions and 
since then have gradually added small but important updates to the GitHub 
Actions workflow [3].


I can elaborate on the details of what each CI workflow does (it is not overly 
complex) but before I do that, is there any preference on choosing one (Jenkins 
Vs GitHub Actions) over the other?


Thanks

lewismc

[0] https://ci-builds.apache.org/job/Nutch/ 

[1] 
https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml 

[2] 
https://github.com/apache/nutch/commit/e33aaa14739c7c02f4121ac1d8d0e7860f329e06 

[3] 
https://github.com/apache/nutch/commits/master/.github/workflows/master-build.yml 


--
http://home.apache.org/~lewismc/ 
http://people.apache.org/keys/committer/lewismc