[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3041.
---

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 stopped by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-05-15 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3041.
-
Resolution: Fixed

> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4232:
---
Fix Version/s: 2.9.3

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-4232.

Resolution: Fixed

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4232) Create and execute unit tests for tika-helm

2024-05-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-4232.
--

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-4233.
--

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-4233.

Resolution: Fixed

This PR broke one of the GitHub Action workflows. I have written to INFRA about 
it

https://issues.apache.org/jira/browse/INFRA-25775

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-05-09 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4233:
---
Fix Version/s: 2.9.3

> Check tika-helm for deprecated k8s APIs
> ---
>
> Key: TIKA-4233
> URL: https://issues.apache.org/jira/browse/TIKA-4233
> Project: Tika
>  Issue Type: New Feature
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.9.3
>
>
> It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
> this would be ideal. The “Check deprecated k8s APIs” GitHub action 
> accomplishes this.
> [https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3054.
---

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3054.
-
Resolution: Fixed

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3054:

Affects Version/s: 1.20

> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3054:
---

 Summary: Address deprecation of Node16 for all GitHub Actions
 Key: NUTCH-3054
 URL: https://issues.apache.org/jira/browse/NUTCH-3054
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


See 
[https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]

We need to upgrade the setup-java action in  
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 

Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3054 started by Lewis John McGibbney.
---
> Address deprecation of Node16 for all GitHub Actions
> 
>
> Key: NUTCH-3054
> URL: https://issues.apache.org/jira/browse/NUTCH-3054
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> See 
> [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/]
> We need to upgrade the setup-java action in  
> [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
>  
> Patch coming up



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842208#comment-17842208
 ] 

Lewis John McGibbney commented on NUTCH-3049:
-

I think that each of the Writable classes mentioned in NutchWritable may be 
fair game

{{        org.apache.nutch.crawl.CrawlDatum.class,}}
{{        org.apache.nutch.crawl.Inlink.class,}}
{{        org.apache.nutch.crawl.Inlinks.class,}}
{{        org.apache.nutch.indexer.NutchIndexAction.class,}}
{{        org.apache.nutch.metadata.Metadata.class,}}
{{        org.apache.nutch.parse.Outlink.class,}}
{{        org.apache.nutch.parse.ParseText.class,}}
{{        org.apache.nutch.parse.ParseData.class,}}
{{        org.apache.nutch.parse.ParseImpl.class,}}
{{        org.apache.nutch.parse.ParseStatus.class,}}
{{        org.apache.nutch.protocol.Content.class,}}
{{        org.apache.nutch.protocol.ProtocolStatus.class,}}
{{        org.apache.nutch.scoring.webgraph.LinkDatum.class,}}
{{        org.apache.nutch.hostdb.HostDatum.class}}

> Investigate using Records
> -
>
> Key: NUTCH-3049
> URL: https://issues.apache.org/jira/browse/NUTCH-3049
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]
> i think there are multiple areas where we could use Records. This ticket will 
> document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3053:
---

 Summary: Upgrade build and CI to JDK17
 Key: NUTCH-3053
 URL: https://issues.apache.org/jira/browse/NUTCH-3053
 Project: Nutch
  Issue Type: Sub-task
  Components: build, ci/cd
Reporter: Lewis John McGibbney


This will involves changes to
 * 
[https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml]
 * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/]
 * [https://github.com/apache/nutch/blob/master/default.properties#L46]
 * [https://github.com/apache/nutch/blob/master/default.properties#L57]
 * We should also investigate any deprecation notices in the build output
 * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3052) Investigate using sealed classes

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3052:
---

 Summary: Investigate using sealed classes
 Key: NUTCH-3052
 URL: https://issues.apache.org/jira/browse/NUTCH-3052
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#sealed-classes]

First document if and where sealed classes would add value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3051:
---

 Summary: Investigate using new pattern matching syntax in switch 
expressions
 Key: NUTCH-3051
 URL: https://issues.apache.org/jira/browse/NUTCH-3051
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions]

Apparently we use switch in 35 files

[https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava=code=Java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3050:
---

 Summary: Investigate use of the enhanced instanceof operator
 Key: NUTCH-3050
 URL: https://issues.apache.org/jira/browse/NUTCH-3050
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator]

Apparently we use instanceof operator in 50 files

[https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof=code]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3049) Investigate using Records

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3049:
---

 Summary: Investigate using Records
 Key: NUTCH-3049
 URL: https://issues.apache.org/jira/browse/NUTCH-3049
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records]

i think there are multiple areas where we could use Records. This ticket will 
document the opportunities and structure that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3048:
---

 Summary: Investigate where/if new string utility methods could be 
used
 Key: NUTCH-3048
 URL: https://issues.apache.org/jira/browse/NUTCH-3048
 Project: Nutch
  Issue Type: Sub-task
  Components: util
Reporter: Lewis John McGibbney


Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods]

We may be able to also revisit our usage of common-* libraries with tje goal of 
using native methods from JDK.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3047) Use multi-line text blocks

2024-04-29 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3047:
---

 Summary: Use multi-line text blocks
 Key: NUTCH-3047
 URL: https://issues.apache.org/jira/browse/NUTCH-3047
 Project: Nutch
  Issue Type: Sub-task
  Components: CLI
Reporter: Lewis John McGibbney


Guidance available at 
[https://www.baeldung.com/java-migrate-8-to-17#2-text-block]

This will help to cleanup our CLI *usage()* messages at a bare minimum.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3046) Use compact strings

2024-04-29 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3046:

Description: 
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are 9 instances where we use _*char []*_

|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].

  was:
Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].


> Use compact strings
> ---
>
> Key: NUTCH-3046
> URL: https://issues.apache.org/jira/browse/NUTCH-3046
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Lewis John McGibbney
>Priority: Major
>
> Follow the guidance at 
> [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]
> It looks like there are 9 instances where we use _*char []*_
> |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3046) Use compact strings

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3046:
---

 Summary: Use compact strings
 Key: NUTCH-3046
 URL: https://issues.apache.org/jira/browse/NUTCH-3046
 Project: Nutch
  Issue Type: Sub-task
Reporter: Lewis John McGibbney


Follow the guidance at 
[https://www.baeldung.com/java-migrate-8-to-17#1-compact-string]

It looks like there are [9 instances where we use 
char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D=code]].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-28 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3045:
---

 Summary: Upgrade from Java 11 to 17
 Key: NUTCH-3045
 URL: https://issues.apache.org/jira/browse/NUTCH-3045
 Project: Nutch
  Issue Type: Task
  Components: build, ci/cd
Reporter: Lewis John McGibbney
 Fix For: 1.21


This parent issue will track and organize work pertaining to upgrading Nutch to 
JDK 17.

Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3042:

Description: 
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I [created a 
discussion|[https://github.com/actions/cache/discussions/1381]] to get 
conformation.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.

  was:
With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.


> Use GitHub cache action to improve CI execution time
> 
>
> Key: NUTCH-3042
> URL: https://issues.apache.org/jira/browse/NUTCH-3042
> Project: Nutch
>  Issue Type: Task
>  Components: ci/cd
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.21
>
>
> With the Ant+Ivy build architecture, the current GitHub actions workflow can 
> and regularly does take over 20 minutes to complete. Dependency retrieval 
> takes a significant amount of time.
> I think we can address the above issue and dramatically reduce the CI runtime 
> by utilizing the official [GitHiub cache 
> action|[https://github.com/actions/cache]].
> It appears however that the action does not support the Apache Ivy cache. 
> Both Maven and Gradle are supported. I [created a 
> discussion|[https://github.com/actions/cache/discussions/1381]] to get 
> conformation.
> In the case that we cannot implement a cache for the Ivy build system then we 
> will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3042:
---

 Summary: Use GitHub cache action to improve CI execution time
 Key: NUTCH-3042
 URL: https://issues.apache.org/jira/browse/NUTCH-3042
 Project: Nutch
  Issue Type: Task
  Components: ci/cd
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


With the Ant+Ivy build architecture, the current GitHub actions workflow can 
and regularly does take over 20 minutes to complete. Dependency retrieval takes 
a significant amount of time.

I think we can address the above issue and dramatically reduce the CI runtime 
by utilizing the official [GitHiub cache 
action|[https://github.com/actions/cache]].

It appears however that the action does not support the Apache Ivy cache. Both 
Maven and Gradle are supported. I created a discussion to get conformation if 
this is the case.

In the case that we cannot implement a cache for the Ivy build system then we 
will need to come back to this issue once we migrate to Gradle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3041 started by Lewis John McGibbney.
---
> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation is actually configured to be used at runtime.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation is actually configured to be used at 
> runtime.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3041:

Description: 
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently [URLExemptionFilters|#L47-L48]] provides the following logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
#0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.

  was:
URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.


> Address confusing logging in o.a.n.net.URLExemptionFilters 
> ---
>
> Key: NUTCH-3041
> URL: https://issues.apache.org/jira/browse/NUTCH-3041
> Project: Nutch
>  Issue Type: Task
>  Components: net
>Affects Versions: 1.19, 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.21
>
>
> URLExemptionFilter impementations are used to allow exemptions to external 
> domain resources by overriding the {{db.ignore.external.links}} configuration 
> setting. This is useful when the crawl is focused to a domain but resources 
> like images are hosted on CDN.
> Currently [URLExemptionFilters|#L47-L48]] provides the following logging
> {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor 
> #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
> {quote}
> I find this confusing. It would be better to log *only* if an 
> URLExemptionFilter implementation actually exists for a given URL.
> I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters

2024-04-19 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3041:
---

 Summary: Address confusing logging in 
o.a.n.net.URLExemptionFilters 
 Key: NUTCH-3041
 URL: https://issues.apache.org/jira/browse/NUTCH-3041
 Project: Nutch
  Issue Type: Task
  Components: net
Affects Versions: 1.19, 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.21


URLExemptionFilter impementations are used to allow exemptions to external 
domain resources by overriding the {{db.ignore.external.links}} configuration 
setting. This is useful when the crawl is focused to a domain but resources 
like images are hosted on CDN.

Currently 
[URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]]
 provides some confusing INFO-level logging
{quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] 
Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
{quote}
I find this confusing. It would be better to log *only* if an 
URLExemptionFilter implementation actually exists for a given URL.

I will provide a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/COMDEV-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed COMDEV-544.
---

> Improve comdev website navigation to GSoC mentor resources
> --
>
> Key: COMDEV-544
> URL: https://issues.apache.org/jira/browse/COMDEV-544
> Project: Community Development
>  Issue Type: Task
>  Components: Website
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> h1. Purpose
> Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
> resources.
> h1. Context
> Having been ‘away’ for a few years, this year I decided to make an attempt to 
> re-engage with the GSoC program.
> I quickly realized that I was totally out of touch having absolutely no idea 
> where the mentor community conversations were happening (they happen on 
> ment...@community.apache.org) and being hopelessly unable to locate GSoC 
> mentoring documentation via the comdev website. 
> Thankfully [~sanyam] [pointed me at the 
> documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
>  and I was able to get back up to speed. Thank you Sanyam :)
> h1. Challenges
> Looking at [https://community.apache.org/gsoc/], as of writing, although 
> loads of content exists for students (which is great) no navigation exists to 
> mentor resources. 
> In my case, this meant that I couldn’t find and entirely missed the excellent 
> content available at 
> [https://community.apache.org/gsoc/guide-to-being-a-mentor.html].
> h1. Proposal
> I think that a “{*}Mentors: read this{*}” Section should be added to 
> [https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
> content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Resolved] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/COMDEV-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved COMDEV-544.
-
Resolution: Fixed

Thanks [~rbowen] for merging.

> Improve comdev website navigation to GSoC mentor resources
> --
>
> Key: COMDEV-544
> URL: https://issues.apache.org/jira/browse/COMDEV-544
> Project: Community Development
>  Issue Type: Task
>  Components: Website
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> h1. Purpose
> Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
> resources.
> h1. Context
> Having been ‘away’ for a few years, this year I decided to make an attempt to 
> re-engage with the GSoC program.
> I quickly realized that I was totally out of touch having absolutely no idea 
> where the mentor community conversations were happening (they happen on 
> ment...@community.apache.org) and being hopelessly unable to locate GSoC 
> mentoring documentation via the comdev website. 
> Thankfully [~sanyam] [pointed me at the 
> documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
>  and I was able to get back up to speed. Thank you Sanyam :)
> h1. Challenges
> Looking at [https://community.apache.org/gsoc/], as of writing, although 
> loads of content exists for students (which is great) no navigation exists to 
> mentor resources. 
> In my case, this meant that I couldn’t find and entirely missed the excellent 
> content available at 
> [https://community.apache.org/gsoc/guide-to-being-a-mentor.html].
> h1. Proposal
> I think that a “{*}Mentors: read this{*}” Section should be added to 
> [https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
> content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Commented] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/COMDEV-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838694#comment-17838694
 ] 

Lewis John McGibbney commented on COMDEV-544:
-

[~sebb] thank you, I was on a mob ile device and actually missed the top 
navigation. Thank you

> Improve comdev website navigation to GSoC mentor resources
> --
>
> Key: COMDEV-544
> URL: https://issues.apache.org/jira/browse/COMDEV-544
> Project: Community Development
>  Issue Type: Task
>  Components: Website
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> h1. Purpose
> Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
> resources.
> h1. Context
> Having been ‘away’ for a few years, this year I decided to make an attempt to 
> re-engage with the GSoC program.
> I quickly realized that I was totally out of touch having absolutely no idea 
> where the mentor community conversations were happening (they happen on 
> ment...@community.apache.org) and being hopelessly unable to locate GSoC 
> mentoring documentation via the comdev website. 
> Thankfully [~sanyam] [pointed me at the 
> documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
>  and I was able to get back up to speed. Thank you Sanyam :)
> h1. Challenges
> Looking at [https://community.apache.org/gsoc/], as of writing, although 
> loads of content exists for students (which is great) no navigation exists to 
> mentor resources. 
> In my case, this meant that I couldn’t find and entirely missed the excellent 
> content available at 
> [https://community.apache.org/gsoc/guide-to-being-a-mentor.html].
> h1. Proposal
> I think that a “{*}Mentors: read this{*}” Section should be added to 
> [https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
> content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Updated] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/COMDEV-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated COMDEV-544:

Description: 
h1. Purpose

Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
resources.
h1. Context

Having been ‘away’ for a few years, this year I decided to make an attempt to 
re-engage with the GSoC program.

I quickly realized that I was totally out of touch having absolutely no idea 
where the mentor community conversations were happening (they happen on 
ment...@community.apache.org) and being hopelessly unable to locate GSoC 
mentoring documentation via the comdev website. 

Thankfully [~sanyam] [pointed me at the 
documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
 and I was able to get back up to speed. Thank you Sanyam :)
h1. Challenges

Looking at [https://community.apache.org/gsoc/], as of writing, although loads 
of content exists for students (which is great) no navigation exists to mentor 
resources. 

In my case, this meant that I couldn’t find and entirely missed the excellent 
content available at 
[https://community.apache.org/gsoc/guide-to-being-a-mentor.html].
h1. Proposal

I think that a “{*}Mentors: read this{*}” Section should be added to 
[https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
content from above. 

  was:
h1. Purpose

Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
resources.
h1. Context

Having been ‘away’ for a few years, this year I decided to make an attempt to 
re-engage with the GSoC program.

I quickly realized that I was totally out of touch having absolutely no idea 
where the mentor community conversations were happening (they happen on 
ment...@community.apache.org) and being hopelessly unable to locate GSoC 
mentoring documentation via the comdev website. 

Thankfully [~sanyam] [pointed me at the 
documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
 and I was able to get back up to speed. Thank you Sanyam :)
h1. Challenges

Looking at [https://community.apache.org/gsoc/], as of writing, although loads 
of content exists for students (which is great) no navigation exists to mentor 
resources. 

In my case, this meant that I couldn’t find and entirely missed the excellent 
content available at [https://community.apache.org/mentoring]/.
h1. Proposal

I think that a “{*}Mentors: read this{*}” Section should be added to 
[https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
content from above. 


> Improve comdev website navigation to GSoC mentor resources
> --
>
> Key: COMDEV-544
> URL: https://issues.apache.org/jira/browse/COMDEV-544
> Project: Community Development
>  Issue Type: Task
>  Components: Website
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> h1. Purpose
> Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
> resources.
> h1. Context
> Having been ‘away’ for a few years, this year I decided to make an attempt to 
> re-engage with the GSoC program.
> I quickly realized that I was totally out of touch having absolutely no idea 
> where the mentor community conversations were happening (they happen on 
> ment...@community.apache.org) and being hopelessly unable to locate GSoC 
> mentoring documentation via the comdev website. 
> Thankfully [~sanyam] [pointed me at the 
> documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
>  and I was able to get back up to speed. Thank you Sanyam :)
> h1. Challenges
> Looking at [https://community.apache.org/gsoc/], as of writing, although 
> loads of content exists for students (which is great) no navigation exists to 
> mentor resources. 
> In my case, this meant that I couldn’t find and entirely missed the excellent 
> content available at 
> [https://community.apache.org/gsoc/guide-to-being-a-mentor.html].
> h1. Proposal
> I think that a “{*}Mentors: read this{*}” Section should be added to 
> [https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
> content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Commented] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/COMDEV-544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838692#comment-17838692
 ] 

Lewis John McGibbney commented on COMDEV-544:
-

Thank you both.

> Improve comdev website navigation to GSoC mentor resources
> --
>
> Key: COMDEV-544
> URL: https://issues.apache.org/jira/browse/COMDEV-544
> Project: Community Development
>  Issue Type: Task
>  Components: Website
>Reporter: Lewis John McGibbney
>Priority: Minor
>
> h1. Purpose
> Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
> resources.
> h1. Context
> Having been ‘away’ for a few years, this year I decided to make an attempt to 
> re-engage with the GSoC program.
> I quickly realized that I was totally out of touch having absolutely no idea 
> where the mentor community conversations were happening (they happen on 
> ment...@community.apache.org) and being hopelessly unable to locate GSoC 
> mentoring documentation via the comdev website. 
> Thankfully [~sanyam] [pointed me at the 
> documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
>  and I was able to get back up to speed. Thank you Sanyam :)
> h1. Challenges
> Looking at [https://community.apache.org/gsoc/], as of writing, although 
> loads of content exists for students (which is great) no navigation exists to 
> mentor resources. 
> In my case, this meant that I couldn’t find and entirely missed the excellent 
> content available at [https://community.apache.org/mentoring]/.
> h1. Proposal
> I think that a “{*}Mentors: read this{*}” Section should be added to 
> [https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
> content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Created] (COMDEV-544) Improve comdev website navigation to GSoC mentor resources

2024-04-18 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created COMDEV-544:
---

 Summary: Improve comdev website navigation to GSoC mentor resources
 Key: COMDEV-544
 URL: https://issues.apache.org/jira/browse/COMDEV-544
 Project: Community Development
  Issue Type: Task
  Components: Website
Reporter: Lewis John McGibbney


h1. Purpose

Improve comdev website navigation to Google Summer of Code (GSoC) mentor 
resources.
h1. Context

Having been ‘away’ for a few years, this year I decided to make an attempt to 
re-engage with the GSoC program.

I quickly realized that I was totally out of touch having absolutely no idea 
where the mentor community conversations were happening (they happen on 
ment...@community.apache.org) and being hopelessly unable to locate GSoC 
mentoring documentation via the comdev website. 

Thankfully [~sanyam] [pointed me at the 
documentation|[https://lists.apache.org/thread/dqmrwzjogl3sdb2v8s36v8mxf5o1yqsj]]
 and I was able to get back up to speed. Thank you Sanyam :)
h1. Challenges

Looking at [https://community.apache.org/gsoc/], as of writing, although loads 
of content exists for students (which is great) no navigation exists to mentor 
resources. 

In my case, this meant that I couldn’t find and entirely missed the excellent 
content available at [https://community.apache.org/mentoring]/.
h1. Proposal

I think that a “{*}Mentors: read this{*}” Section should be added to 
[https://community.apache.org/gsoc/] which simply hyperlinks to the relevant 
content from above. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org



[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3038.
-
Resolution: Fixed

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3038.
---

Thanks [~snagel] 

> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-08 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 stopped by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4232) Create and execute unit tests for tika-helm

2024-04-08 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835077#comment-17835077
 ] 

Lewis John McGibbney commented on TIKA-4232:


It turns out that the original GitHub action I wanted to use will  not be 
approved to use. 

I’m therefore investigating running the tests via the 
[https://github.com/marketplace/actions/docker-run-action] to run the 
{{{}helmunittest/helm-unittest Docker image{}}},  and generate the junit report 
and then using the [https://github.com/marketplace/actions/junit-report-action] 
to report the tests to the PR. 

 

I’ll do further investigation and followup here. 

> Create and execute unit tests for tika-helm
> ---
>
> Key: TIKA-4232
> URL: https://issues.apache.org/jira/browse/TIKA-4232
> Project: Tika
>  Issue Type: Improvement
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The goal is to execute chart unit tests against each tika-helm pull request.
> I found the [Helm Unit 
> Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
> which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.
> The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3038 started by Lewis John McGibbney.
---
> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3038:

Description: 
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade apache parent pom version from 23 to 31
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template

  was:
During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template


> Address issues discovered during 1.20 release management dryrun
> ---
>
> Key: NUTCH-3038
> URL: https://issues.apache.org/jira/browse/NUTCH-3038
> Project: Nutch
>  Issue Type: Task
>  Components: build, docker
>Affects Versions: 1.20
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.20
>
>
> During the 1.20 release management dryrun I discovered the following issues 
> which I think should be addressed in order to be satisfied with the release 
> candidate
>  # Update docker/README to remove broken badge
>  # Upgrade alpine base image in docker/Dockerfile
>  # Migrate CHANGES.txt to CHANGES.md
>  # Upgrade apache parent pom version from 23 to 31
>  # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
>  # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
> ivy/mvn.template
>  # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun

2024-04-05 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3038:
---

 Summary: Address issues discovered during 1.20 release management 
dryrun
 Key: NUTCH-3038
 URL: https://issues.apache.org/jira/browse/NUTCH-3038
 Project: Nutch
  Issue Type: Task
  Components: build, docker
Affects Versions: 1.20
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


During the 1.20 release management dryrun I discovered the following issues 
which I think should be addressed in order to be satisfied with the release 
candidate
 # Update docker/README to remove broken badge
 # Upgrade alpine base image in docker/Dockerfile
 # Migrate CHANGES.txt to CHANGES.md
 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml
 # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in 
ivy/mvn.template
 # Remove miredot plugin usage from ivy/mvn.template



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-04-04 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3032.
---

Thanks [~jglvary] and congratulations on your first contribution to Apache 
Nutch :)

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4233) Check tika-helm for deprecated k8s APIs

2024-03-30 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-4233:
--

 Summary: Check tika-helm for deprecated k8s APIs
 Key: TIKA-4233
 URL: https://issues.apache.org/jira/browse/TIKA-4233
 Project: Tika
  Issue Type: New Feature
  Components: tika-helm
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.9.2


It is useful to know when a Helm Chart uses deprecated k8s APIs. A check for 
this would be ideal. The “Check deprecated k8s APIs” GitHub action accomplishes 
this.

[https://github.com/marketplace/actions/check-deprecated-k8s-apis]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4232) Create and execute unit tests for tika-helm

2024-03-30 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-4232:
--

 Summary: Create and execute unit tests for tika-helm
 Key: TIKA-4232
 URL: https://issues.apache.org/jira/browse/TIKA-4232
 Project: Tika
  Issue Type: Improvement
  Components: tika-helm
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.9.2


The goal is to execute chart unit tests against each tika-helm pull request.

I found the [Helm Unit 
Tests|[https://github.com/marketplace/actions/helm-unit-tests]] GitHub Action 
which uses [https://github.com/helm-unittest/helm-unittest] as a Helm plugin.

The PR will consist of one or more unit tests automated via the GitHub action.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-30 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832505#comment-17832505
 ] 

Lewis John McGibbney commented on TIKA-4227:


Available at [https://artifacthub.io/packages/helm/apache-tika/tika]

> Register tika-helm Chart in artifacthub.io
> --
>
> Key: TIKA-4227
> URL: https://issues.apache.org/jira/browse/TIKA-4227
> Project: Tika
>  Issue Type: Task
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.9.2
>
>
> [https://artifacthub.io/] represents the most popular search interface for 
> (amongst lots of other artifacts) Helm Charts.
> This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-4227.

Resolution: Fixed

> Register tika-helm Chart in artifacthub.io
> --
>
> Key: TIKA-4227
> URL: https://issues.apache.org/jira/browse/TIKA-4227
> Project: Tika
>  Issue Type: Task
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.9.2
>
>
> [https://artifacthub.io/] represents the most popular search interface for 
> (amongst lots of other artifacts) Helm Charts.
> This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-4227.
--

> Register tika-helm Chart in artifacthub.io
> --
>
> Key: TIKA-4227
> URL: https://issues.apache.org/jira/browse/TIKA-4227
> Project: Tika
>  Issue Type: Task
>  Components: tika-helm
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.9.2
>
>
> [https://artifacthub.io/] represents the most popular search interface for 
> (amongst lots of other artifacts) Helm Charts.
> This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3032:

Fix Version/s: 1.20

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Fix For: 1.20
>
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-3032:
---

Assignee: Joe Gilvary

> Indexing plugin as an adapter for end user's own POJO instances
> ---
>
> Key: NUTCH-3032
> URL: https://issues.apache.org/jira/browse/NUTCH-3032
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Joe Gilvary
>Assignee: Joe Gilvary
>Priority: Major
>  Labels: indexing
> Attachments: NUTCH-3032.patch
>
>
> It could be helpful to let end users manipulate information at indexing time 
> with their own code without the need for writing their own indexing plugin. I 
> mentioned this on the dev mailing list 
> (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some 
> description of my work in progress.
> One potential use is to address some of the same concerns that NUTCH-585 
> discusses regarding an alternative approach to picking and choosing which 
> content to index, but this approach would allow making index time decisions, 
> rather than setting the configuration for all content at the start of the 
> indexing run.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2856 stopped by Lewis John McGibbney.
---
> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 stopped by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Closed] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2832.
---

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2832.
-
Resolution: Won't Fix

Given the license changes regarding the concerned backend I have no interest 
implementing this anymore. 

> Create tutorial on sending Nutch logs to Elasticsearch
> --
>
> Key: NUTCH-2832
> URL: https://issues.apache.org/jira/browse/NUTCH-2832
> Project: Nutch
>  Issue Type: New Feature
>  Components: configuration, deployment
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> A while back I used to use [Chukwa|https://chukwa.apache.org/] for log 
> aggregation and analysis. Chukwa is now retired. 
> I a bit of research into directly logging Log4j2 into Elasticsearch and came 
> across 
> [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which 
> looks pretty simple.
> I'm going to have a crack at implementing this functionality as a 
> configuration option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3036.
-
Resolution: Fixed

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3036.
---

> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3035.
---

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3035.
-
Resolution: Fixed

> Update license and notice file for release of 1.20 
> ---
>
> Key: NUTCH-3035
> URL: https://issues.apache.org/jira/browse/NUTCH-3035
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Close to the release of 1.20 the license and notice files should be updated 
> to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and 
> NUTCH-2981.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3037.
-
Resolution: Fixed

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-30 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3037.
---

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4227) Register tika-helm Chart in artifacthub.io

2024-03-26 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-4227:
--

 Summary: Register tika-helm Chart in artifacthub.io
 Key: TIKA-4227
 URL: https://issues.apache.org/jira/browse/TIKA-4227
 Project: Tika
  Issue Type: Task
  Components: tika-helm
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.9.2


[https://artifacthub.io/] represents the most popular search interface for 
(amongst lots of other artifacts) Helm Charts.

This task will register the tika-helm Chart with [https://artifacthub.io/].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 stopped by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3037:

Flags: Patch

> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3037 started by Lewis John McGibbney.
---
> Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
> --
>
> Key: NUTCH-3037
> URL: https://issues.apache.org/jira/browse/NUTCH-3037
> Project: Nutch
>  Issue Type: Task
>  Components: indexer-kafka
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, 
> I therefore propose to upgrade.
> I will also state that a _*kafka_2.13*_ artifact exists. This would demand 
> that the underlying Scala version be also upgraded... but I think this should 
> be addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0

2024-03-21 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3037:
---

 Summary: Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
 Key: NUTCH-3037
 URL: https://issues.apache.org/jira/browse/NUTCH-3037
 Project: Nutch
  Issue Type: Task
  Components: indexer-kafka
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, I 
therefore propose to upgrade.

I will also state that a _*kafka_2.13*_ artifact exists. This would demand that 
the underlying Scala version be also upgraded... but I think this should be 
addressed in a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 stopped by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3036 started by Lewis John McGibbney.
---
> Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
> 
>
> Key: NUTCH-3036
> URL: https://issues.apache.org/jira/browse/NUTCH-3036
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, selenium
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> lib-selenium currently packages org.seleniumhq.selenium:selenium-java 
> *v4.7.2* but *v4.18.1* is available on Maven Central.
> This ticket will upgrade the java dependency and validate that both 
> protocol-selenium and protocol-interactiveselenium work as expected in local 
> mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium

2024-03-14 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3036:
---

 Summary: Upgrade org.seleniumhq.selenium:selenium-java dependency 
in lib-selenium
 Key: NUTCH-3036
 URL: https://issues.apache.org/jira/browse/NUTCH-3036
 Project: Nutch
  Issue Type: Improvement
  Components: selenium, plugin
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


lib-selenium currently packages org.seleniumhq.selenium:selenium-java *v4.7.2* 
but *v4.18.1* is available on Maven Central.

This ticket will upgrade the java dependency and validate that both 
protocol-selenium and protocol-interactiveselenium work as expected in local 
mode and via selenium grid.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IVY-1651) Augment 'Child elements’ section of 'File System Resolver' documentation

2024-03-13 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/IVY-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826781#comment-17826781
 ] 

Lewis John McGibbney commented on IVY-1651:
---

PR available at [https://github.com/apache/ant-ivy/pull/101]

> Augment 'Child elements’ section of 'File System Resolver' documentation
> 
>
> Key: IVY-1651
> URL: https://issues.apache.org/jira/browse/IVY-1651
> Project: Ivy
>  Issue Type: Improvement
>  Components: Documentation, Maven Compatibility
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 2.5.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I [recently encountered some 
> confusion|[https://lists.apache.org/thread/tzvtw4j2d9pcxhqjxyb2dwnsk50t47b5]] 
> when upgrading from Ivy 2.5.0 —> 2.5.2.
> I think the documentation at 
> [https://ant.apache.org/ivy/history/2.5.2/resolver/filesystem.html#_child_elements]
>  could be augmented to atleast link back to the [Maven 
> documentation|[https://maven.apache.org/pom.html#dependencies]] which 
> explicitly references acceptable constituent values for the resolver pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IVY-1651) Augment 'Child elements’ section of 'File System Resolver' documentation

2024-03-13 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created IVY-1651:
-

 Summary: Augment 'Child elements’ section of 'File System 
Resolver' documentation
 Key: IVY-1651
 URL: https://issues.apache.org/jira/browse/IVY-1651
 Project: Ivy
  Issue Type: Improvement
  Components: Documentation, Maven Compatibility
Reporter: Lewis John McGibbney
 Fix For: 2.5.3


I [recently encountered some 
confusion|[https://lists.apache.org/thread/tzvtw4j2d9pcxhqjxyb2dwnsk50t47b5]] 
when upgrading from Ivy 2.5.0 —> 2.5.2.

I think the documentation at 
[https://ant.apache.org/ivy/history/2.5.2/resolver/filesystem.html#_child_elements]
 could be augmented to atleast link back to the [Maven 
documentation|[https://maven.apache.org/pom.html#dependencies]] which 
explicitly references acceptable constituent values for the resolver pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-13 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17826776#comment-17826776
 ] 

Lewis John McGibbney commented on NUTCH-3029:
-

Hi [~martin.dj] [~markus17] it looks like we are missing some Javadoc

 
{quote} [javadoc] Standard Doclet version 11.0.22 {quote}
{quote} [javadoc] Building tree for all the packages and classes... 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @param for url 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @return 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193:
 warning: no @throws for java.net.URISyntaxException 
 [javadoc] public static String getHostName(String url) throws 
URISyntaxException { 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:205:
 warning: no @return 
 [javadoc] public float getMaxInterval(Text url, float defaultMaxInterval){ 
 [javadoc] ^ 
 [javadoc] 
/home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:227:
 warning: no @return 
 [javadoc] public float getMinInterval(Text url, float defaultMinInterval){ 
{quote}
{quote} [javadoc] ^{quote}
 

> Host specific max. and min. intervals in adaptive scheduler
> ---
>
> Key: NUTCH-3029
> URL: https://issues.apache.org/jira/browse/NUTCH-3029
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.19, 1.20
>Reporter: Martin Djukanovic
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adaptive-host-specific-intervals.txt.template, 
> new_adaptive_fetch_schedule-1.patch
>
>
> This patch implements custom max. and min. refetching intervals for specific 
> hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt 
> configuration file (template also attached).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3033.
---

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3033.
-
Resolution: Fixed

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3033:

Due Date: 12/Mar/24  (was: 11/Mar/24)

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work stopped] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 stopped by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 # *Update Nutch plugin documentation* 
 # {*}Create/propose plugin utility toolings{*}: #4 in the motivation section 
states that developing plugins in clunky. A utility tool which streamlines the 
creation of new plugins would be ideal. For example, this could take the form 
of a [new bash script|[https://github.com/apache/nutch/tree/master/src/bin]] 
which prompts the developer for input and then generates the plugin skeleton. 
{*}This is a nice to have{*}.

h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]

 
h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 #  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :).
 * *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki.
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 

[jira] [Created] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3034:
---

 Summary: Overhaul the legacy Nutch plugin framework and replace it 
with PF4J
 Key: NUTCH-3034
 URL: https://issues.apache.org/jira/browse/NUTCH-3034
 Project: Nutch
  Issue Type: Improvement
  Components: pf4j, plugin
Reporter: Lewis John McGibbney


h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3033:
---

 Summary: Upgrade Ivy to v2.5.2
 Key: NUTCH-3033
 URL: https://issues.apache.org/jira/browse/NUTCH-3033
 Project: Nutch
  Issue Type: Task
  Components: ivy
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.

[https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 started by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3024.
---

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3024.
-
Resolution: Fixed

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4169:
---
Description: 
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.

I will go on the hunt for some sample files we can use in unit tests. I think 
we can make some available via 
[https://github.com/Open-MBEE/perseverance-modelica]

  was:
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.


> Create a parser for Functional Mockup Unit (FMU) media type with .fmu 
> extension
> ---
>
> Key: TIKA-4169
> URL: https://issues.apache.org/jira/browse/TIKA-4169
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>
> An Functional Mockup Unit (FMU) is a software component used for exchanging 
> and simulating dynamic system models. It is designed to enable simulations of 
> system models regardless of the simulation tool, programming language, or 
> hardware platform. This is made possible through a standard interface that 
> allows FMUs to be exported and imported across different simulation 
> environments.
> The FMU media type ships with the .fmu file suffix
> I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] 
> can be used as the underlying parser implementation.
> I will go on the hunt for some sample files we can use in unit tests. I think 
> we can make some available via 
> [https://github.com/Open-MBEE/perseverance-modelica]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4169:
---
Description: 
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.

  was:
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

 

The FMU media type ships with the .fmu file suffix 

 

I think the MIT licensed [NTNU-IHB/FMI4j|[https://github.com/NTNU-IHB/FMI4j]] 
can be used as the underlying parser implementation.


> Create a parser for Functional Mockup Unit (FMU) media type with .fmu 
> extension
> ---
>
> Key: TIKA-4169
> URL: https://issues.apache.org/jira/browse/TIKA-4169
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>
> An Functional Mockup Unit (FMU) is a software component used for exchanging 
> and simulating dynamic system models. It is designed to enable simulations of 
> system models regardless of the simulation tool, programming language, or 
> hardware platform. This is made possible through a standard interface that 
> allows FMUs to be exported and imported across different simulation 
> environments.
> The FMU media type ships with the .fmu file suffix
> I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] 
> can be used as the underlying parser implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created TIKA-4169:
--

 Summary: Create a parser for Functional Mockup Unit (FMU) media 
type with .fmu extension
 Key: TIKA-4169
 URL: https://issues.apache.org/jira/browse/TIKA-4169
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney


An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

 

The FMU media type ships with the .fmu file suffix 

 

I think the MIT licensed [NTNU-IHB/FMI4j|[https://github.com/NTNU-IHB/FMI4j]] 
can be used as the underlying parser implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3007) Fix impossible casts

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3007.
---

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2846) Fix various bugs spotted by NUTCH-2815

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2846.
---

> Fix various bugs spotted by NUTCH-2815
> --
>
> Key: NUTCH-2846
> URL: https://issues.apache.org/jira/browse/NUTCH-2846
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses various bugs spotted by Spotbugs (NUTCH-2815):
> - use static method Integer.parseInt(...)
> - use integer arithmetic instead of floating point with rounding floats 
> afterwards
> - erroneous declaration of constructor in BasicURLNormalizer
> - fix bracketing when calculating hash code of CrawlDatum



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2852.
---

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2819) Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2819.
---

> Move spotbugs "installation" directory to avoid that spotbugs is shipped in 
> Nutch runtime
> -
>
> Key: NUTCH-2819
> URL: https://issues.apache.org/jira/browse/NUTCH-2819
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Minor
> Fix For: 1.19
>
>
> With NUTCH-2816 the Spotbugs tool is "installed" in lib/. However, files in 
> lib/ are copied to build/ and runtime/. To avoid that the spotbugs jars are 
> shipped in runtime and eventually also releases, the spotbugs installation 
> folder should be moved into a different directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2851) Random object created and used only once

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2851.
---

> Random object created and used only once
> 
>
> Key: NUTCH-2851
> URL: https://issues.apache.org/jira/browse/NUTCH-2851
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dmoz, generator, indexer, segment
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.crawl.Generator
> In method org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> Called method java.util.Random.nextInt()
> At Generator.java:[line 1016]
> Random object created and used only once in 
> org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> This code creates a java.util.Random object, uses it to generate one random 
> number, and then discards the Random object. This produces mediocre quality 
> random numbers and is inefficient. If possible, rewrite the code so that the 
> Random object is created once and saved, and each time a new random number is 
> required invoke a method on the existing Random object to obtain it.
> If it is important that the generated Random numbers not be guessable, you 
> must not create a new Random for each random number; the values are too 
> easily guessable. You should strongly consider using a 
> java.security.SecureRandom instead (and avoid allocating a new SecureRandom 
> for each random number needed).
> This bad practice also affects the following
> org.apache.nutch.indexer.IndexingJob since first historized release
> org.apache.nutch.segment.SegmentReader since first historized release
> org.apache.nutch.tools.DmozParser$RDFProcessor since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-2850) Method ignores exceptional return value

2023-11-10 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2850.
---

> Method ignores exceptional return value
> ---
>
> Key: NUTCH-2850
> URL: https://issues.apache.org/jira/browse/NUTCH-2850
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dumpers
>Affects Versions: 1.18
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.tools.FileDumper
> In method org.apache.nutch.tools.FileDumper.dump(File, File, String[], 
> boolean, boolean, boolean)
> Called method java.io.File.mkdirs()
> At FileDumper.java:[line 237]
> Exceptional return value of java.io.File.mkdirs() ignored in 
> org.apache.nutch.tools.FileDumper.dump(File, File, String[], boolean, 
> boolean, boolean)
> This method returns a value that is not checked. The return value should be 
> checked since it can indicate an unusual or unexpected function execution. 
> For example, the File.delete() method returns false if the file could not be 
> successfully deleted (rather than throwing an Exception). If you don't check 
> the result, you won't notice if the method invocation signals unexpected 
> behavior by returning an atypical return value. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-03 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3024:
---

 Summary: Remove flaky 'dependency check' target
 Key: NUTCH-3024
 URL: https://issues.apache.org/jira/browse/NUTCH-3024
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a 
thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
covering my observations running the ant _*dependency-check*_ target. It fails 
unpredictably in both GitHub actions and our trusty Jenkins builds on 
ci-builds.apache.org.

I propose to simply remove this target (and associated configuration) in a bid 
to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3023) Use mikepenz/action-junit-report to improve interpretation of failed tests during CI

2023-11-02 Thread Lewis John McGibbney (Jira)
Lewis John McGibbney created NUTCH-3023:
---

 Summary: Use mikepenz/action-junit-report to improve 
interpretation of failed tests during CI
 Key: NUTCH-3023
 URL: https://issues.apache.org/jira/browse/NUTCH-3023
 Project: Nutch
  Issue Type: Task
  Components: build, test
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


The following GitHub action could help improve the interpretation of unit test 
anomalies during a CI run.

[https://github.com/mikepenz/action-junit-report]

Rather than having to grep through the GitHub Action log, one could save time 
by interpreting the comments posted to the PR conversation thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3014.
---

Thanks [~snagel] for the review

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >