[jira] [Commented] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842209#comment-17842209
 ] 

ASF GitHub Bot commented on NUTCH-3054:
---

lewismc opened a new pull request, #817:
URL: https://github.com/apache/nutch/pull/817

   Addresses https://issues.apache.org/jira/browse/NUTCH-3054




> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3054:

Affects Version/s: 1.20

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3054:
---

 Summary: Address deprecation of Node16 for all GitHub Actions
 Key: NUTCH-3054
 URL: https://issues.apache.org/jira/browse/NUTCH-3054
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


See 
[https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]

We need to upgrade the setup-java action in  
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 

Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3054 started by Lewis John McGibbney.
---
> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208
 ] 

Lewis John McGibbney commented on NUTCH-3049:
-

I think that each of the Writable classes mentioned in NutchWritable may be 
fair game

{{        org.apache.nutch.crawl.CrawlDatum.class,}}
{{        org.apache.nutch.crawl.Inlink.class,}}
{{        org.apache.nutch.crawl.Inlinks.class,}}
{{        org.apache.nutch.indexer.NutchIndexAction.class,}}
{{        org.apache.nutch.metadata.Metadata.class,}}
{{        org.apache.nutch.parse.Outlink.class,}}
{{        org.apache.nutch.parse.ParseText.class,}}
{{        org.apache.nutch.parse.ParseData.class,}}
{{        org.apache.nutch.parse.ParseImpl.class,}}
{{        org.apache.nutch.parse.ParseStatus.class,}}
{{        org.apache.nutch.protocol.Content.class,}}
{{        org.apache.nutch.protocol.ProtocolStatus.class,}}
{{        org.apache.nutch.scoring.webgraph.LinkDatum.class,}}
{{        org.apache.nutch.hostdb.HostDatum.class}}

> Investigate using Records
> -
>
> Key: NUTCH-3049
> URL: https://issues.apache.org/jira/browse/NUTCH-3049
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]
> i think there are multiple areas where we could use Records. This ticket will 
> document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-29 Thread Lewis John McGibbney
Hi Sebastian,
Understood. If it ain’t broke don’t fix it.
Thanks for the input.

On 2024/04/28 12:08:27 Sebastian Nagel wrote:
> 
>  From my side: no. It may not harm to have both.
> 
> Best,
> Sebastian


[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3053:
---

 Summary: Upgrade build and CI to JDK17
 Key: NUTCH-3053
 URL: https://issues.apache.org/jira/browse/NUTCH-3053
 Project: Nutch
  Issue Type: Sub-task
  Components: build, ci/cd
Reporter: Lewis John McGibbney


This will involves changes to
 * 
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/]
 * [https://github.com/apache/nutch/blob/master/default.properties#L46]
 * [https://github.com/apache/nutch/blob/master/default.properties#L57]
 * We should also investigate any deprecation notices in the build output
 * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3052) Investigate using sealed classes

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3052:
---

 Summary: Investigate using sealed classes
 Key: NUTCH-3052
 URL: https://issues.apache.org/jira/browse/NUTCH-3052
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#sealed-classes]

First document if and where sealed classes would add value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3051:
---

 Summary: Investigate using new pattern matching syntax in switch 
expressions
 Key: NUTCH-3051
 URL: https://issues.apache.org/jira/browse/NUTCH-3051
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions]

Apparently we use switch in 35 files

[https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3050:
---

 Summary: Investigate use of the enhanced instanceof operator
 Key: NUTCH-3050
 URL: https://issues.apache.org/jira/browse/NUTCH-3050
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator]

Apparently we use instanceof operator in 50 files

[https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3049:
---

 Summary: Investigate using Records
 Key: NUTCH-3049
 URL: https://issues.apache.org/jira/browse/NUTCH-3049
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]

i think there are multiple areas where we could use Records. This ticket will 
document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3048:
---

 Summary: Investigate where/if new string utility methods could be 
used
 Key: NUTCH-3048
 URL: https://issues.apache.org/jira/browse/NUTCH-3048
 Project: Nutch
  Issue Type: Sub-task
  Components: util
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods]

We may be able to also revisit our usage of common-* libraries with tje goal of 
using native methods from JDK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3047) Use multi-line text blocks

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3047:
---

 Summary: Use multi-line text blocks
 Key: NUTCH-3047
 URL: https://issues.apache.org/jira/browse/NUTCH-3047
 Project: Nutch
  Issue Type: Sub-task
  Components: CLI
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-text-block]

This will help to cleanup our CLI *usage()* messages at a bare minimum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3046) Use compact strings

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3046:

Description: 
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are 9 instances where we use _*char []*_

|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].

  was:
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].


> Use compact strings
> ---
>
> Key: NUTCH-3046
> URL: https://issues.apache.org/jira/browse/NUTCH-3046
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Follow the guidance at 
> [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]
> It looks like there are 9 instances where we use _*char []*_
> |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-04-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841995#comment-17841995
 ] 

ASF GitHub Bot commented on NUTCH-1806:
---

sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816

   and NUTCH-1942 Remove TopLevelDomain
   
   - use methods from crawler-commons' EffectiveTldFinder in URLUtil  replacing 
classed and methods from the "org.apache.nutch.util.domain" package
   
   - adapt and extend unit tests
 - add tests for URLUtil.getTopLevelDomainName(url)
 - reflect changes to the public suffix list since 2014 ("xyz" is now a 
public suffix / ICANN suffix)
 - adapt to minor API changes
- URLUtil.getDomainName(url) returns the host name in case no valid 
public suffix is found
- for Unicode suffixes and TLDs the methods 
URLUtil.getDomainSuffix(url) resp.  URLUtil.getTopLevelDomainName(url) now 
return the ASCII representation
  - add unit tests for host names with trailing dot ("www.apache.org.")
  - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit 
test for URLs without host/domain (cf. NUTCH-2450)
   
   - update and complete Javadoc
   
   - update DomainStatistics, TLDIndexingFilter and domain URL filters to use 
the updated methods in URLUtil
   - remove the class TLDScoringFilter. The configuration is bound to the 
domain-suffixes.xml which wasn't maintained anymore and is now removed
   - remove package org.apache.nutch.util.domain
   - move DomainStatistics to org.apache.nutch.util
   - remove configuration files of domain utils




> Delegate processing of URL domains to crawler commons
> -
>
> Key: NUTCH-1806
> URL: https://issues.apache.org/jira/browse/NUTCH-1806
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.8
>Reporter: Julien Nioche
>Priority: Major
>  Labels: crawler-commons
> Fix For: 1.21
>
>
> We have code in src/java/org/apache/nutch/util/domain and a resource file 
> conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
> URLUtil.getDomainName.
> The resource file is not necessarily up to date and since crawler commons has 
> a similar functionality we should use it instead of having to maintain our 
> own resources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] NUTCH-1806 Delegate processing of URL domains to crawler-common [nutch]

2024-04-29 Thread via GitHub


sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816

   and NUTCH-1942 Remove TopLevelDomain
   
   - use methods from crawler-commons' EffectiveTldFinder in URLUtil  replacing 
classed and methods from the "org.apache.nutch.util.domain" package
   
   - adapt and extend unit tests
 - add tests for URLUtil.getTopLevelDomainName(url)
 - reflect changes to the public suffix list since 2014 ("xyz" is now a 
public suffix / ICANN suffix)
 - adapt to minor API changes
- URLUtil.getDomainName(url) returns the host name in case no valid 
public suffix is found
- for Unicode suffixes and TLDs the methods 
URLUtil.getDomainSuffix(url) resp.  URLUtil.getTopLevelDomainName(url) now 
return the ASCII representation
  - add unit tests for host names with trailing dot ("www.apache.org.")
  - add add unit test for URLs without host/domain (cf. NUTCH-2450)unit 
test for URLs without host/domain (cf. NUTCH-2450)
   
   - update and complete Javadoc
   
   - update DomainStatistics, TLDIndexingFilter and domain URL filters to use 
the updated methods in URLUtil
   - remove the class TLDScoringFilter. The configuration is bound to the 
domain-suffixes.xml which wasn't maintained anymore and is now removed
   - remove package org.apache.nutch.util.domain
   - move DomainStatistics to org.apache.nutch.util
   - remove configuration files of domain utils


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org