from:"Lewis John McGibbney"

Re: Self Introduction - Xuanwo

2024-03-13 Thread Lewis John McGibbney

Nice welcome Xuanwo thanks for introeucing yourself.
lewismc

On 2024/03/10 05:20:20 Xuanwo wrote:
> Hello, everyone
> 
> I'm Xuanwo, and I'm following the "Contribute" guide in 
> comdev-working-groups[1] to introduce myself and kickstart my contributions :)
> 
> My personal vision is "Empowering freely data access from ANY storage service 
> in ANY method". Open source is definitely an important part of achieving my 
> vision.
> 
> - I'm the PMC Chair for Apache OpenDAL [2], a project that graduated in 
> January 2024, aimed at enabling free data access.
> - I work at Databend Labs [3], focusing on cost-effective data analysis.
> - I'm also contributing to Apache Iceberg [4] to simplify reading SQL tables.
> 
> My current interest lies in open source sustainability. I want to learn how 
> to ensure a project's sustainability and foster community growth. I'm here to 
> explore how I can contribute to expanding the ASF community.
> 
> Pleased to meet you here; I'm looking forward to working together with you.
> 
> [1]: https://github.com/apache/comdev-working-groups
> [2]: https://github.com/apache/opendal
> [3]: https://github.com/datafuselabs/databend/
> [4]: https://github.com/apache/iceberg-rust
> 
> Xuanwo
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
> For additional commands, e-mail: dev-h...@community.apache.org
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org

Re: [QUESTION] What should community do in GSoC timeline?

2024-03-13 Thread Lewis John McGibbney

Hi Xuanwo,
It’s been a few years since I participated in GSoC as a mentor… but this year I 
intend to. Let me see if I can provide answers to some of your questions.

On 2024/03/11 03:07:29 Xuanwo wrote:
> 
> 2024-02-22: Potential GSoC contributors discuss application ideas with 
> mentoring organizations
> 
> Q: Should those ideas/proposals been posted to mailing list? Or just discuss 
> with mentors?

Mailing list is great however I don’t think there are any hard rules. This 
period is really just for attracting interest in the initiative (if it was 
created by the PMC/Committership) or convincing a PMC to take on your 
initiative (if it was created by a potential GSoC student).

> Q: Should student-submitted ideas/proposals be added to Jira?

Yes absolutely. Make sure the JIRA issue is labeled with “gsoc2024” as well. 
That way it will show up in the filter at 
https://issues.apache.org/jira/issues/?jql=labels+%3D+gsoc2024.

> 
> 2024-04-15: Proposals to ASF projects must be reviewed roughly and have a 
> potential mentor so that we know how many slots to request.
> 
> Q: Who will review/rank/score those proposals? The corresponding community's 
> PMC?
> 

In short yes but really it is down to the mentor(s). It is always good to have 
a backup mentor as well in-case the mentor is unable to see the project through.

HTH
lewismc

-
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org

[jira] [Closed] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3033.
---

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-13 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3033.
-
Resolution: Fixed

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [DISCUSS] Release Nutch 1.20

2024-03-12 Thread Lewis John McGibbney

I submitted a patch for the Ivy 2.5.2 upgrade. If folks could have a look at 
that it would be ideal.
https://github.com/apache/nutch/pull/803
I am free to roll a release candidate towards the end of this week.
lewismc

On 2024/03/10 15:08:36 Lewis John McGibbney wrote:
> Nice  
> I wee that we  are a couple releases behind of Ivy as well as I’ll submit a 
> patch for that.
> I can push this release this time. It’s been a while since I exercised the 
> workflow and it would be good to blow away the cobb webs.
> lewismc
> 
> On 2024/03/10 11:55:20 Markus Jelsma wrote:
> > Good idea! I'll finish work on three open issues the next week.
> > 
> > Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel <
> > wastl.na...@googlemail.com>:
> > 
> > > Hi Lewis,
> > >
> > > yes, of course!
> > >
> > > Some points we should do before the release:
> > >
> > > - address the ES licensing issue,
> > >the easiest way is to downgrade, see NUTCH-3008
> > >If done update the license-related files.
> > >
> > > - there are three short PRs open
> > >
> > > I'll try to have a look at these points the next days.
> > >
> > > Best,
> > > Sebastian
> > >
> > >
> > > On 3/8/24 01:43, lewis john mcgibbney wrote:
> > > > Hi dev@,
> > > > As of today, 51 issues have been addressed in the 1.20 development 
> > > > drive.
> > > > https://issues.apache.org/jira/projects/NUTCH/versions/12352190
> > > > <https://issues.apache.org/jira/projects/NUTCH/versions/12352190>
> > > > I would like to push a release soon and ship it to the user community.
> > > > Any objections?
> > > > Thank you
> > > > lewismc
> > > >
> > >
> > 
>

[jira] [Updated] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3033:

Due Date: 12/Mar/24  (was: 11/Mar/24)

> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work stopped] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 stopped by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Differences in retrieve pattern between Ivy 2.5.0/2.5.1 & 2.5.2?

2024-03-12 Thread Lewis John McGibbney

Thanks for this guidance Stefan :) 
I was able to get a patch together at https://github.com/apache/nutch/pull/803
Hopefully this helps others who may be confused as I was.
Thank you
lewsmc

On 2024/03/12 18:57:51 Stefan Bodewig wrote:
> On 2024-03-11, lewis john mcgibbney wrote:
> 
> > I am working on upgrading Ivy to latest over in the Apache Nutch project.
> > The build works just fine with 2.5.0 and 2.5.1 but with 2.5.2 the CI
> > fails with the following complaint
> 
> > /home/runner/work/nutch/nutch/src/plugin/build-plugin.xml:234:
> > impossible to ivy retrieve: java.lang.RuntimeException: problem during
> > retrieve of org.apache.nutch#lib-htmlunit: java.lang.RuntimeException:
> > Multiple artifacts of the module
> > io.netty#netty-transport-native-kqueue;4.1.84.Final are retrieved to
> > the same file! Update the retrieve pattern to fix this error.
> 
> Ivy 2.5.2 fixes a bug[1] when dealing with dependencies that have
> multiple Maven artifacts with different Maven classifiers. Prior to
> 2.5.2 Ivy would think they'd all be the same and just pick one.
> 
> io.netty#netty-transport-native-kqueue has several artifacts, at least
> this is what the repo looks like. I completely fail to understand the
> POM :-)
> 
> Your pattern probably needs a [classifier] to make sure two artifacts
> that differ by Maven classifier also target different file names.
> 
> Something like
> 
> pattern="${local-maven2-dir}/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]"
> 
> Stefan
> 
> [1] https://issues.apache.org/jira/browse/IVY-1642
>

[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney

Hi user@ & dev@,

I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on dev@nutch.apache.org (please CC
lewi...@apache.org).

Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J
JIRA: https://issues.apache.org/jira/browse/NUTCH-3034

Thanks in advance, and good luck to prospective GSoC applicants.

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney

Hi user@ & dev@,

I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on d...@nutch.apache.org (please CC
lewi...@apache.org).

Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J
JIRA: https://issues.apache.org/jira/browse/NUTCH-3034

Thanks in advance, and good luck to prospective GSoC applicants.

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 # *Update Nutch plugin documentation* 
 # {*}Create/propose plugin utility toolings{*}: #4 in the motivation section 
states that developing plugins in clunky. A utility tool which streamlines the 
creation of new plugins would be ideal. For example, this could take the form 
of a [new bash script|[https://github.com/apache/nutch/tree/master/src/bin]] 
which prompts the developer for input and then generates the plugin skeleton. 
{*}This is a nice to have{*}.

h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]

 
h1. Google Summer of Code Details

This initiative is being proposed as a GSoC 2024 project. 

{*}Proposed Mentor{*}: [~lewismc] 

{*}Proposed Co-Mentor{*}:

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 currently 7 tests as of writing. Traditionally, developers have focused on 
providing unit tests on the plugin-level as opposed to the legacy plugin 
framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.
 # generally speaking, any reduction of code in the Nutch codebase through 
careful selection and dependence of well maintained, well tested 3rd party 
libraries would be a good thing for the Nutch codebase.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :). Generally 
speaking just familiarize ones-self with the legacy plugin framework and 
understand where the gaps are.
 # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki. Create mapping of [legacy 
Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]]
 to [PF4J 
equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]].
 # {*}Restructure the legacy Nutch plugin package{*}: 
[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]
 # {*}Restructure each plugin in the plugins directory{*}: 
[https://github.com/apache/nutch/tree/master/src/plugin]
 #  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :).
 * *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will 
provide an opportunity to identify gaps between what the legacy plugin 
framework does (and what Nutch) needs Vs what PF4J provides. Touch base with 
the PF4J community, describe the intention to replace the legacy Nutch plugin 
framework with PF4J. Obtain guidance on how to proceed. Document this all in 
the Nutch wiki.
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc

[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3034:

Description: 
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are [fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 

  was:
h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from

[jira] [Created] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3034:
---

 Summary: Overhaul the legacy Nutch plugin framework and replace it 
with PF4J
 Key: NUTCH-3034
 URL: https://issues.apache.org/jira/browse/NUTCH-3034
 Project: Nutch
  Issue Type: Improvement
  Components: pf4j, plugin
Reporter: Lewis John McGibbney


h1. Motivation

Plugins provide a large part of the functionality of Nutch. Although the legacy 
plugin framework continues to offer lots of value i.e.,
 # [some aspects e.g. examples, are fairly well 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]]
 # it is generally stable, and
 # offers reasonable test coverage (on a plugin-by-plugin basis)
 # … probably loads more positives which I am overlooking...

… there are also several aspects which could be improved
 # the [core framework is sparsely 
documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]],
 this extends to very important aspects like the {*}plugin lifecycle{*}, 
{*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other 
topics which are of intrinsic value to developers and maintainers. 
 # the core framework is somewhat [sparsely 
tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]…
 only 7 tests. Traditionally, developers have focused on providing unit tests 
on the plugin-level as opposed to the legacy plugin framework.
 # see’s very low maintenance/attention. It is my gut feeling (and I may be 
totally wrong here) but I _think_ that not many people know much about the core 
legacy plugin framework.
 # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy 
build and dependency management system, but that being said, it is clunky 
non-the-less.

*This issue therefore proposes to overhaul the* *legacy* *Nutch plugin 
framework and replace it with Plugin Framework for Java (PF4J).*
h1. Task Breakdown

The following is a proposed breakdown of this overall initiative intp Epics. 
These Epics should likely be decomposed further but that will be left down to 
the implementer(s).
 * {*}perform feasibility study{*}; touch base with the PF4J community, 
describe the intention to replace the legacy Nutch plugin framework with PF4J. 
Obtain guidance on how to proceed. Document this all in the Nutch wiki.
 * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from 
[PF4J’s plugin lifecycle 
documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both 
documentation and a diagram which clearly outline how the legacy plugin 
lifecycle works. Might also be a good idea to make a contribution to PF4J and 
provide them with a diagram to accompany their documentation :)
 *  

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Differences in retrieve pattern between Ivy 2.5.0/2.5.1 & 2.5.2?

2024-03-11 Thread lewis john mcgibbney

Hi ivy-user@,
I am working on upgrading Ivy to latest over in the Apache Nutch project.
The build works just fine with 2.5.0 and 2.5.1 but with 2.5.2 the CI
fails with the following complaint

/home/runner/work/nutch/nutch/src/plugin/build-plugin.xml:234:
impossible to ivy retrieve: java.lang.RuntimeException: problem during
retrieve of org.apache.nutch#lib-htmlunit: java.lang.RuntimeException:
Multiple artifacts of the module
io.netty#netty-transport-native-kqueue;4.1.84.Final are retrieved to
the same file! Update the retrieve pattern to fix this error.

I’m not sure what to do here… any ideas would be appreciated.

The Nutch ivysettings.xml van be found at
https://github.com/apache/nutch/blob/master/ivy/ivysettings.xml

Thanks for any assistance.
lewismc


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[jira] [Created] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3033:
---

 Summary: Upgrade Ivy to v2.5.2
 Key: NUTCH-3033
 URL: https://issues.apache.org/jira/browse/NUTCH-3033
 Project: Nutch
  Issue Type: Task
  Components: ivy
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.

[https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (NUTCH-3033) Upgrade Ivy to v2.5.2

2024-03-11 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3033 started by Lewis John McGibbney.
---
> Upgrade Ivy to v2.5.2
> -
>
> Key: NUTCH-3033
> URL: https://issues.apache.org/jira/browse/NUTCH-3033
> Project: Nutch
>  Issue Type: Task
>  Components: ivy
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> Ivy v2.5.2 was released August 20th 2023. Let’s upgrade.
> [https://ant.apache.org/ivy/history/2.5.2/release-notes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [DISCUSS] Release Nutch 1.20

2024-03-10 Thread Lewis John McGibbney

Nice  
I wee that we  are a couple releases behind of Ivy as well as I’ll submit a 
patch for that.
I can push this release this time. It’s been a while since I exercised the 
workflow and it would be good to blow away the cobb webs.
lewismc

On 2024/03/10 11:55:20 Markus Jelsma wrote:
> Good idea! I'll finish work on three open issues the next week.
> 
> Op za 9 mrt 2024 om 13:02 schreef Sebastian Nagel <
> wastl.na...@googlemail.com>:
> 
> > Hi Lewis,
> >
> > yes, of course!
> >
> > Some points we should do before the release:
> >
> > - address the ES licensing issue,
> >the easiest way is to downgrade, see NUTCH-3008
> >If done update the license-related files.
> >
> > - there are three short PRs open
> >
> > I'll try to have a look at these points the next days.
> >
> > Best,
> > Sebastian
> >
> >
> > On 3/8/24 01:43, lewis john mcgibbney wrote:
> > > Hi dev@,
> > > As of today, 51 issues have been addressed in the 1.20 development drive.
> > > https://issues.apache.org/jira/projects/NUTCH/versions/12352190
> > > <https://issues.apache.org/jira/projects/NUTCH/versions/12352190>
> > > I would like to push a release soon and ship it to the user community.
> > > Any objections?
> > > Thank you
> > > lewismc
> > >
> >
>

Re: Indexing arbitrary fields

2024-03-08 Thread Lewis John McGibbney

Hi Joe,
Thanks for describing your work in detail. It provides a great utility which I 
think could be of immense value.
Please feel free to create a JIRA ticket which can be used as the basis for 
linking to the prior similar examples you referenced.
A WIP pull request would be ideal.
Thanks
lewismc

On 2024/03/08 01:06:18 Joe Gilvary wrote:
> Good day, all,
> 
> I wanted to index some values that I had to derive from fields in the 
> NutchDocument. I started on an indexing plugin. Then I realized I would 
> need more than one, or I could generalize the plugin. I went with the 
> generalizing and wrote a plugin that will use custom POJOs to process & 
> inject whatever the Nutch user wants, based on properties in 
> NUTCH_CONF_DIR/nutch-site.xml. I've tested it so far with
> 
> one POJO that uses jsoup to extract values from the page based on a CSS 
> selector specified in nutch-site.xml,
> 
> another POJO that takes a regex from nutch-site.xml and applies it to 
> the URL to determine how "deep" the URL directory structure goes for the 
> document,
> 
> and a third toy POJO to take multiple arguments from nutch-site.xml and 
> return their product. That last test was just to be sure the plug-in 
> would handle more than two arguments in the property value.
> 
> There's an optional boolean in the config to set whether to overwrite an 
> existing field, or (by default) add to it. Finally, I hacked a naming 
> convention and the way the plugin uses the setConf() call so the plugin 
> will accept configuration for multiple different POJOs to set multiple 
> fields in the NutchDocument. I didn't see any examples of a plugin 
> running more than once for each document quite that way, so I'm not sure 
> if this conforms to whatever canonical approach might exist.
> 
> I think of this plugin as a way to extend the reach of the plugin 
> architecture's flexibility out to POJO-land :) for anyone who 
> can't/won't for whatever reason write a plugin of their own. The POJOs 
> have to accept a String in a constructor, but they don't work on 
> NutchDocument or CrawlDatum or anything. I think if the plugin wants to 
> pass all that to a POJO for reflection, it's a clever way to waste time 
> when the work could be done in the plugin itself. For some subset of 
> indexing requirements, I think this could be useful to a wider set of 
> users. Still, I'm not a wider set of users, so I'm asking here.
> 
> NUTCH-585 has a lot of discussion about a concern similar to what this 
> jsoup example enables and Solr itself includes the 
> URLClassifierProcessor that addresses the same type of task that the 
> regex example shows, so is there any interest in this kind of 
> generalized plugin? Just from those examples, it could enable some 
> altered version of those capabilities. I've only built and tested with 
> the 1.19 branch and main branch code so far, and only with a Solr 9.2.1 
> cloud install, 'cause that's what I'm running, but if it seems 
> worthwhile to others, I'll beef up the documentation and write JUnit cases.
> 
>   Thanks, stay safe, stay healthy,
> 
>   Joe
> 
>

[DISCUSS] Release Nutch 1.20

2024-03-07 Thread lewis john mcgibbney

Hi dev@,
As of today, 51 issues have been addressed in the 1.20 development drive.
https://issues.apache.org/jira/projects/NUTCH/versions/12352190
I would like to push a release soon and ship it to the user community.
Any objections?
Thank you
lewismc

Re: [DISCUSS] Graduate Apache SDAP (Incubating) as a Top Level Project

2024-03-07 Thread Lewis John McGibbney

Julien’s has very succinctly described the community growth challenges and 
podling direction. For a number of years I acted as mentor for SDAP and was 
puzzled by the inability for the community to push releases. This still 
concerns me...

That being said, there is definitely potential (the software is being used) and 
I do feel that SDAP should graduate.

Please carry my +1 through to a VOTE.

Thanks, and congratulations to the SDAP community… and a HUGE thanks for Julien 
as well.

lewismc

On 2024/02/22 18:01:31 Riley Kuttruff wrote:
> Hi all,
> 
> Apache SDAP joined Incubator in October 2017. In the time since, we've 
> made significant progress towards maturing our community and our 
> project and adopting the Apache Way.
> 
> After community discussion [1][2][3], the community has voted [4] that we 
> would like to proceed with graduation [5]. We now call upon the Incubator 
> PMC to review and discuss our progress and would appreciate any and all 
> feedback towards graduation.
> 
> Below are some facts and project highlights from the incubation phase as 
> well as the draft resolution:
> 
> - Our community consists of 21 committers, with 2 being mentors and 
> the remaining 19 serving as our PPMC
> - Several pending and planned invites to bring on new committers and/or
> PPMC members from additional organizations
> - Completed 2 releases with 2 release managers - with a 3rd release run by
> a 3rd release manager in progress
> - Our software is currently being utilized by organizations such as NASA 
> Jet Propulsion Laboratory, NSF National Center for Atmospheric Research, 
> Florida State University, and George Mason University in support of projects 
> such as the NASA Sea Level Change Portal, Estimating the Circulation and 
> Climate of the Ocean (ECCO) project, GRACE/GRACE-FO, Cloud-based 
> Data Match-Up Service, Integrated Digital Earth Analysis System (IDEAS), 
> and many others.  
> - Opened 400+ PRs across 3 main code repositories, 350+ of which are
> merged or closed (some are pending our next release)
> - Maturity model self assessment [6]
> 
> We have resolved all branding issues we are aware of: logo, GitHub, 
> Website, etc
> 
> We’d like to also extend a sincere thank you to our mentors, current and
> former for their invaluable insight and assistance with getting us to this
> point.
> 
> Thank you, Julian, Jörn, Trevor, Lewis, Suneel, and Raphael!
> 
> ---
> 
> Establish the Apache SDAP Project
> 
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to an integrated data analytic center for Big Science problems.
> 
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache SDAP Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
> 
> RESOLVED, that the Apache SDAP Project be and hereby is responsible
> for the creation and maintenance of software related to an integrated data 
> analytic center for Big Science problems; and be it further
> 
> RESOLVED, that the office of "Vice President, Apache SDAP" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache SDAP
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache SDAP
> Project; and be it further
> 
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache SDAP Project:
> 
> - Edward M Armstrong 
> - Nga Thien Chung 
> - Thomas Cram 
> - Frank Greguska 
> - Thomas Huang 
> - Julian Hyde 
> - Joseph C. Jacob 
> - Jason Kang 
> - Riley Kuttruff 
> - Thomas G Loubrieu 
> - Kevin Marlis 
> - Stepheny Perez 
> - Wai Linn Phyo 
> 
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Nga Thien Chung 
> be appointed to the office of Vice President, Apache SDAP, to serve in 
> accordance with and subject to the direction of the Board of Directors 
> and the Bylaws of the Foundation until death, resignation, retirement, 
> removal or disqualification, or until a successor is appointed; and be it 
> further
> 
> RESOLVED, that the Apache SDAP Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator SDAP
> podling; and be it further
> 
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> SDAP podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
> 
> [1] https://lists.apache.org/thread/vjwjmp0h2f22dv423h262cvdg5x7jl03
> [2] https://lists.apache.org/thread/m9vqwv23jdsofwgmhgxg25f5l1v2j7nz
> [3]

Re: [DISCUSS] Incubating Proposal for StormCrawler

2024-03-07 Thread Lewis John McGibbney

I think StromCrawler would be an excellent candidate for the Incubator. 
If the podling is looking for an additional mentor, I would be happy to chip in.
lewismc

On 2024/03/03 23:24:38 PJ Fanning wrote:
> Hi everyone,
> 
> I would like to propose StormCrawler [1] as a new Apache Incubator project,
> and you can examine the proposal [2] for more details.
> 
> StormCrawler is a collection of resources for building low-latency,
> customisable and scalable web crawlers on Apache Storm.
> 
> Proposal
> 
> The aim of StormCrawler is to help build web crawlers that are:
> 
> * scalable
> * resilient
> * low latency
> * easy to extend
> * polite yet efficient
> 
> StormCrawler achieves this partly with Apache Storm, which it is based
> on. To use an analogy, Apache Storm is to StormCrawler what Apache
> Hadoop is to Apache Nutch.
> 
> StormCrawler is mature (26 releases to date) and is used by many
> organisations world-wide.
> 
> Initial Committers
> 
> Julien Nioche [jnio...@apache.org https://github.com/jnioche]
> Sebastian Nagel [sna...@apache.org https://github.com/sebastian-nagel]
> Richard Zowalla [r...@apache.org  https://github.com/rzo1]
> Tim Allison [talli...@apache.org https://github.com/tballison]
> Michael Dinzinger [michael.dinzin...@uni-passau.de
> https://github.com/michaeldinzinger]
> 
> Most of the existing StormCrawler contributors are existing ASF
> committers and are looking to build a vibrant community following the
> Apache Way.
> 
> I will help this project as the champion and mentor. We would welcome
> additional mentors, if anyone has an interest in helping.
> 
> We are looking forward to your questions and feedback.
> 
> Thanks,
> PJ
> 
> [1] https://github.com/DigitalPebble/storm-crawler
> [2] 
> https://cwiki.apache.org/confluence/display/INCUBATOR/StormCrawler+Proposal
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 
> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Graduate Apache Celeborn (Incubating) as a Top Level Project

2024-03-07 Thread Lewis John McGibbney

+1 
Excellent work on the Incubating releases and community building,
lewismc

On 2024/03/05 06:00:49 Yu Li wrote:
> Hi All,
> 
> Apache Celeborn joined Incubator in October 2022 [1]. Since then,
> we've made significant progress towards maturing our community and
> adopting the Apache Way.
> 
> After a thorough discussion [2], the community has voted [3] that we
> would like to proceed with graduation [4]. Furthermore, we'd like to
> call upon the Incubator PMC to review and discuss our progress and
> would appreciate any and all feedback towards graduation.
> 
> Below are some facts and project highlights from the incubation phase
> as well as the draft resolution:
> 
> - Currently, our community consists of 19 committers (including
> mentors) from more than 10 companies, with 13 serving as PPMC members
> [5].
> - So far, we have boasted 81 contributors.
> - Throughout the incubation period, we've made 6 releases [6] in 16
> months, at a stable pace.
> - We've had 6 different release managers to date.
> - Our software is used in production by 10+ well known entities [7].
> - As yet, we have opened 1,302 issues with 1,191 successfully resolved [8].
> - We have submitted a total of 1,840 PRs, out of which 1,830 have been
> merged or closed [9].
> - Through self-assessment [10], we have met all maturity criteria as
> outlined in [11].
> 
> We've resolved all branding issues which include Logo, GitHub repo,
> document, website, and others [12] [13].
> 
> We'd also like to take this opportunity to extend a sincere thank you
> to our mentors, for their invaluable insight and assistance with
> getting us to this point.
> 
> Thanks a lot, Becket Qin, Duo Zhang, Lidong Dai, Willem Ning Jiang and Yu Li!
> 
> ---
> 
> Establish the Apache Celeborn Project
> 
> WHEREAS, the Board of Directors deems it to be in the best interests of
> the Foundation and consistent with the Foundation's purpose to establish
> a Project Management Committee charged with the creation and maintenance
> of open-source software, for distribution at no charge to the public,
> related to an intermediate data service for big data computing engines
> to boost performance, stability, and flexibility.
> 
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Celeborn Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
> 
> RESOLVED, that the Apache Celeborn Project be and hereby is responsible
> for the creation and maintenance of software related to an intermediate
> data service for big data computing engines to boost performance,
> stability, and flexibility; and be it further
> 
> RESOLVED, that the office of "Vice President, Apache Celeborn" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Celeborn
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Celeborn
> Project; and be it further
> 
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Celeborn
> Project:
> 
> * Becket Qin 
> * Cheng Pan 
> * Duo Zhang 
> * Ethan Feng 
> * Fu Chen 
> * Jiashu Xiong 
> * Kerwin Zhang 
> * Keyong Zhou 
> * Lidong Dai 
> * Willem Ning Jiang 
> * Wu Wei 
> * Yi Zhu 
> * Yu Li 
> 
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Keyong Zhou be appointed to
> the office of Vice President, Apache Celeborn, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
> 
> RESOLVED, that the Apache Celeborn Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Celeborn
> podling; and be it further
> 
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Celeborn podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
> 
> ---
> 
> Best Regards,
> Yu (on behalf of the Apache Celeborn PPMC)
> 
> [1] https://incubator.apache.org/projects/celeborn.html
> [2] https://lists.apache.org/thread/z17rs0mw4nyv0s112dklmv7s3j053mby
> [3] https://lists.apache.org/thread/p1gykvxog456v5chvwmr4wk454qzmh3o
> [4] https://lists.apache.org/thread/tqhh28q9r38czx677nh2ktc97tnlndw3
> [5] https://celeborn.apache.org/community/project_management_committee
> [6] 
> https://issues.apache.org/jira/projects/CELEBORN?selectedItem=com.atlassian.jira.jira-projects-plugin:release-page=released
> [7] https://github.com/apache/incubator-celeborn/issues/2140
> [8] https://s.apache.org/celeborn_jira_issues
> [9] https://github.com/apache/incubator-celeborn/pulls
> [10] 
>

[jira] [Closed] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3024.
---

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-24 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3024.
-
Resolution: Fixed

> Remove flaky 'dependency check' target
> --
>
> Key: NUTCH-3024
> URL: https://issues.apache.org/jira/browse/NUTCH-3024
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> I [started a 
> thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
> covering my observations running the ant _*dependency-check*_ target. It 
> fails unpredictably in both GitHub actions and our trusty Jenkins builds on 
> ci-builds.apache.org.
> I propose to simply remove this target (and associated configuration) in a 
> bid to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4169:
---
Description: 
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.

I will go on the hunt for some sample files we can use in unit tests. I think 
we can make some available via 
[https://github.com/Open-MBEE/perseverance-modelica]

  was:
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.


> Create a parser for Functional Mockup Unit (FMU) media type with .fmu 
> extension
> ---
>
> Key: TIKA-4169
> URL: https://issues.apache.org/jira/browse/TIKA-4169
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>
> An Functional Mockup Unit (FMU) is a software component used for exchanging 
> and simulating dynamic system models. It is designed to enable simulations of 
> system models regardless of the simulation tool, programming language, or 
> hardware platform. This is made possible through a standard interface that 
> allows FMUs to be exported and imported across different simulation 
> environments.
> The FMU media type ships with the .fmu file suffix
> I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] 
> can be used as the underlying parser implementation.
> I will go on the hunt for some sample files we can use in unit tests. I think 
> we can make some available via 
> [https://github.com/Open-MBEE/perseverance-modelica]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-4169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-4169:
---
Description: 
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

The FMU media type ships with the .fmu file suffix

I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] can 
be used as the underlying parser implementation.

  was:
An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

 

The FMU media type ships with the .fmu file suffix 

 

I think the MIT licensed [NTNU-IHB/FMI4j|[https://github.com/NTNU-IHB/FMI4j]] 
can be used as the underlying parser implementation.


> Create a parser for Functional Mockup Unit (FMU) media type with .fmu 
> extension
> ---
>
> Key: TIKA-4169
> URL: https://issues.apache.org/jira/browse/TIKA-4169
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>
> An Functional Mockup Unit (FMU) is a software component used for exchanging 
> and simulating dynamic system models. It is designed to enable simulations of 
> system models regardless of the simulation tool, programming language, or 
> hardware platform. This is made possible through a standard interface that 
> allows FMUs to be exported and imported across different simulation 
> environments.
> The FMU media type ships with the .fmu file suffix
> I think the MIT licensed [NTNU-IHB/FMI4j|https://github.com/NTNU-IHB/FMI4j] 
> can be used as the underlying parser implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-4169) Create a parser for Functional Mockup Unit (FMU) media type with .fmu extension

2023-11-13 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created TIKA-4169:
--

 Summary: Create a parser for Functional Mockup Unit (FMU) media 
type with .fmu extension
 Key: TIKA-4169
 URL: https://issues.apache.org/jira/browse/TIKA-4169
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney


An Functional Mockup Unit (FMU) is a software component used for exchanging and 
simulating dynamic system models. It is designed to enable simulations of 
system models regardless of the simulation tool, programming language, or 
hardware platform. This is made possible through a standard interface that 
allows FMUs to be exported and imported across different simulation 
environments.

 

The FMU media type ships with the .fmu file suffix 

 

I think the MIT licensed [NTNU-IHB/FMI4j|[https://github.com/NTNU-IHB/FMI4j]] 
can be used as the underlying parser implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-3007) Fix impossible casts

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3007.
---

> Fix impossible casts
> 
>
> Key: NUTCH-3007
> URL: https://issues.apache.org/jira/browse/NUTCH-3007
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Spotbugs reports two occurrences of
>   Impossible cast from java.util.ArrayList to String[] in 
> org.apache.nutch.fetcher.Fetcher.run(Map, String)
> Both were introduced later into the {{run(Map args, String 
> crawlId)}} method and obviously never used (would throw a 
> ClassCastException). The code blocks should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2846) Fix various bugs spotted by NUTCH-2815

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2846.
---

> Fix various bugs spotted by NUTCH-2815
> --
>
> Key: NUTCH-2846
> URL: https://issues.apache.org/jira/browse/NUTCH-2846
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.19
>
>
> This issue addresses various bugs spotted by Spotbugs (NUTCH-2815):
> - use static method Integer.parseInt(...)
> - use integer arithmetic instead of floating point with rounding floats 
> afterwards
> - erroneous declaration of constructor in BasicURLNormalizer
> - fix bracketing when calculating hash code of CrawlDatum



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2852.
---

> Method invokes System.exit(...) 9 bugs
> --
>
> Key: NUTCH-2852
> URL: https://issues.apache.org/jira/browse/NUTCH-2852
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> org.apache.nutch.indexer.IndexingFiltersChecker since first historized release
> In class org.apache.nutch.indexer.IndexingFiltersChecker
> In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[])
> At IndexingFiltersChecker.java:[line 96]
> Another occurrence at IndexingFiltersChecker.java:[line 129]
> org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes 
> System.exit(...), which shuts down the entire virtual machine
> Invoking System.exit shuts down the entire Java virtual machine. This should 
> only been done when it is appropriate. Such calls make it hard or impossible 
> for your code to be invoked by other code. Consider throwing a 
> RuntimeException instead.
> Also occurs in
>org.apache.nutch.net.URLFilterChecker since first historized release
>org.apache.nutch.net.URLNormalizerChecker since first historized release
>org.apache.nutch.parse.ParseSegment since first historized release
>org.apache.nutch.parse.ParserChecker since first historized release
>org.apache.nutch.service.NutchServer since first historized release
>org.apache.nutch.tools.CommonCrawlDataDumper since first historized release
>org.apache.nutch.tools.DmozParser since first historized release
>org.apache.nutch.util.AbstractChecker since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2819) Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2819.
---

> Move spotbugs "installation" directory to avoid that spotbugs is shipped in 
> Nutch runtime
> -
>
> Key: NUTCH-2819
> URL: https://issues.apache.org/jira/browse/NUTCH-2819
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.18
>Reporter: Sebastian Nagel
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Minor
> Fix For: 1.19
>
>
> With NUTCH-2816 the Spotbugs tool is "installed" in lib/. However, files in 
> lib/ are copied to build/ and runtime/. To avoid that the spotbugs jars are 
> shipped in runtime and eventually also releases, the spotbugs installation 
> folder should be moved into a different directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2851) Random object created and used only once

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2851.
---

> Random object created and used only once
> 
>
> Key: NUTCH-2851
> URL: https://issues.apache.org/jira/browse/NUTCH-2851
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dmoz, generator, indexer, segment
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.crawl.Generator
> In method org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> Called method java.util.Random.nextInt()
> At Generator.java:[line 1016]
> Random object created and used only once in 
> org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int)
> This code creates a java.util.Random object, uses it to generate one random 
> number, and then discards the Random object. This produces mediocre quality 
> random numbers and is inefficient. If possible, rewrite the code so that the 
> Random object is created once and saved, and each time a new random number is 
> required invoke a method on the existing Random object to obtain it.
> If it is important that the generated Random numbers not be guessable, you 
> must not create a new Random for each random number; the values are too 
> easily guessable. You should strongly consider using a 
> java.security.SecureRandom instead (and avoid allocating a new SecureRandom 
> for each random number needed).
> This bad practice also affects the following
> org.apache.nutch.indexer.IndexingJob since first historized release
> org.apache.nutch.segment.SegmentReader since first historized release
> org.apache.nutch.tools.DmozParser$RDFProcessor since first historized release 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-2850) Method ignores exceptional return value

2023-11-10 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2850.
---

> Method ignores exceptional return value
> ---
>
> Key: NUTCH-2850
> URL: https://issues.apache.org/jira/browse/NUTCH-2850
> Project: Nutch
>  Issue Type: Sub-task
>  Components: dumpers
>Affects Versions: 1.18
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.19
>
>
> In class org.apache.nutch.tools.FileDumper
> In method org.apache.nutch.tools.FileDumper.dump(File, File, String[], 
> boolean, boolean, boolean)
> Called method java.io.File.mkdirs()
> At FileDumper.java:[line 237]
> Exceptional return value of java.io.File.mkdirs() ignored in 
> org.apache.nutch.tools.FileDumper.dump(File, File, String[], boolean, 
> boolean, boolean)
> This method returns a value that is not checked. The return value should be 
> checked since it can indicate an unusual or unexpected function execution. 
> For example, the File.delete() method returns false if the file could not be 
> successfully deleted (rather than throwing an Exception). If you don't check 
> the result, you won't notice if the method invocation signals unexpected 
> behavior by returning an atypical return value. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3024) Remove flaky 'dependency check' target

2023-11-03 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3024:
---

 Summary: Remove flaky 'dependency check' target
 Key: NUTCH-3024
 URL: https://issues.apache.org/jira/browse/NUTCH-3024
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a 
thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] 
covering my observations running the ant _*dependency-check*_ target. It fails 
unpredictably in both GitHub actions and our trusty Jenkins builds on 
ci-builds.apache.org.

I propose to simply remove this target (and associated configuration) in a bid 
to clean up some flaky legacy build code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Removing “dependency-check” target from build.xml

2023-11-03 Thread lewis john mcgibbney

Hi dev@,

Recently I was doing a bit of work on CI and made an attempt to activate
the “dependency-check” target (previously named “report-vulnerabilities”).

It appears that the underlying “dependency-check” tooling is flaky at best.
It appears to take an awful long time to execute and seems to be prone to
hanging.

I propose to remove this target and implement something more stable in the
future… when I work on finishing the Gradle build.

lewismc

[jira] [Created] (NUTCH-3023) Use mikepenz/action-junit-report to improve interpretation of failed tests during CI

2023-11-02 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3023:
---

 Summary: Use mikepenz/action-junit-report to improve 
interpretation of failed tests during CI
 Key: NUTCH-3023
 URL: https://issues.apache.org/jira/browse/NUTCH-3023
 Project: Nutch
  Issue Type: Task
  Components: build, test
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


The following GitHub action could help improve the interpretation of unit test 
anomalies during a CI run.

[https://github.com/mikepenz/action-junit-report]

Rather than having to grep through the GitHub Action log, one could save time 
by interpreting the comments posted to the PR conversation thread.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3014.
---

Thanks [~snagel] for the review

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3014.
-
Resolution: Fixed

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3022) Experiment formatting codebase per google-java-format

2023-11-02 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3022:
---

 Summary: Experiment formatting codebase per google-java-format
 Key: NUTCH-3022
 URL: https://issues.apache.org/jira/browse/NUTCH-3022
 Project: Nutch
  Issue Type: Task
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I [started a mailing list 
thread|https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n] which 
quizzed whether we should integrate code linting/formatting into the CI.

Seb provided some excellent, calculated input which inspired me to create this 
ticket.

I will create a PR which lints the Nutcj codebase per the *google-java-format* 
and discuss the results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work stopped] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 stopped by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Nutch codebase formatting

2023-11-02 Thread Lewis John McGibbney

Thanks Seb. I'll go ahead and try to build in the google Java format via 
super-linter and see where we get...!
lewismc

On 2023/10/29 17:04:47 Sebastian Nagel wrote:
> Hi Lewis,
> 
>  >> whether we need a Nutch custom code style at all… why don’t we just use
>  >> some other existing style and then enforce it?
> 
> Enforcing: yes!
> 
> However, I would try hard to keep the changes on a reasonable minimum. For 
> example, if we change the indentation, almost every code line is affected 
> which 
> makes
> - "git annotate" mostly useless (or more difficult to use because you need 
> look
>back)
> - merges of open PRs, custom patches or modifications in custom repositories
>might get quite painful, until the formatting is synchronized.
> 
> 
>  >> * google Java format [1] which offers a GitHub action for easy integration
>  >> into our CI process, or
> 
> +1
> 
> + available also for Intellij, Eclipse
> + indentation stays the same
> +/- about 25% of the code lines are changed (might be acceptable)
> 
> 
>  >> * superlinter [3] basically emerging as the industry OSS default, offers a
>  >> GitHub action and could also be configured to lint dockerfile, and other
>  >> artifacts. It can also be configured to use the google Java style as well…
> 
> +1 (with Google Java style)
> 
> 
>  > I’ll submit a PR for superlinter so everyone can see what it would look 
> like.
> 
> Great! Thanks!
> 
> 
> Best,
> Sebastian
> 
> On 10/29/23 00:38, Lewis John McGibbney wrote:
> > Any thoughts on this folks.
> > I’ll submit a PR for superlinter so everyone can see what it would look 
> > like.
> > lewismc
> > 
> > On 2023/10/23 19:28:45 lewis john mcgibbney wrote:
> >> Hi dev@,
> >>
> >> For the longest time the Nutch codebase has shipped with a
> >> eclipse-codeformat.xml [0] file.
> >> Whilst this has been largely successful in keeping the codebase uniform, it
> >> cannot/has not been integrated into continuous integration (CI)  and
> >> subsequently not really enforced!
> >>
> >> Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
> >> should have some CI code formatting checks. Additionally I really question
> >> whether we need a Nutch custom code style at all… why don’t we just use
> >> some other existing style and then enforce it?
> >>
> >> I therefore propose that we replace the legacy code formatter with a
> >> convention such as
> >>
> >> * google Java format [1] which offers a GitHub action for easy integration
> >> into our CI process, or
> >> * check style [2] which offers an Ant task which we could use, this is of
> >> less utility as we think about the move to grade
> >> * superlinter [3] basically emerging as the industry OSS default, offers a
> >> GitHub action and could also be configured to lint dockerfile, and other
> >> artifacts. It can also be configured to use the google Java style as well…
> >>
> >> My preference would be [3] because it offers a more comprehensive linting
> >> package for the entire codebase not just the Java code.
> >>
> >> Thanks for your consideration.
> >> lewismc
> >>
> >> [0]
> >> https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
> >> [1]
> >> https://github.com/google/google-java-format
> >> [2]
> >> https://checkstyle.sourceforge.io/
> >> [3]
> >> https://github.com/marketplace/actions/super-linter
> >>
>

Re: Nutch codebase formatting

2023-10-28 Thread Lewis John McGibbney

Any thoughts on this folks.
I’ll submit a PR for superlinter so everyone can see what it would look like.
lewismc 

On 2023/10/23 19:28:45 lewis john mcgibbney wrote:
> Hi dev@,
> 
> For the longest time the Nutch codebase has shipped with a
> eclipse-codeformat.xml [0] file.
> Whilst this has been largely successful in keeping the codebase uniform, it
> cannot/has not been integrated into continuous integration (CI)  and
> subsequently not really enforced!
> 
> Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
> should have some CI code formatting checks. Additionally I really question
> whether we need a Nutch custom code style at all… why don’t we just use
> some other existing style and then enforce it?
> 
> I therefore propose that we replace the legacy code formatter with a
> convention such as
> 
> * google Java format [1] which offers a GitHub action for easy integration
> into our CI process, or
> * check style [2] which offers an Ant task which we could use, this is of
> less utility as we think about the move to grade
> * superlinter [3] basically emerging as the industry OSS default, offers a
> GitHub action and could also be configured to lint dockerfile, and other
> artifacts. It can also be configured to use the google Java style as well…
> 
> My preference would be [3] because it offers a more comprehensive linting
> package for the entire codebase not just the Java code.
> 
> Thanks for your consideration.
> lewismc
> 
> [0]
> https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
> [1]
> https://github.com/google/google-java-format
> [2]
> https://checkstyle.sourceforge.io/
> [3]
> https://github.com/marketplace/actions/super-linter
>

[jira] [Work stopped] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 stopped by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3015.
---

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-27 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3015.
-
Resolution: Fixed

> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2023-10-24 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2887 started by Lewis John McGibbney.
---
> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefi

[jira] [Created] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2023-10-24 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3016:
---

 Summary: Upgrade Apache Ivy to 2.5.2
 Key: NUTCH-3016
 URL: https://issues.apache.org/jira/browse/NUTCH-3016
 Project: Nutch
  Issue Type: Task
  Components: ivy, build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


[Apache Ivy v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] 
was released on August 20 2023!

We should upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2023-10-23 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2887:
---

Assignee: Lewis John McGibbney

> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/n

[jira] [Work started] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-23 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 started by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (NUTCH-3014) Standardize Job names

2023-10-23 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 started by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Nutch codebase formatting

2023-10-23 Thread lewis john mcgibbney

Hi dev@,

For the longest time the Nutch codebase has shipped with a
eclipse-codeformat.xml [0] file.
Whilst this has been largely successful in keeping the codebase uniform, it
cannot/has not been integrated into continuous integration (CI)  and
subsequently not really enforced!

Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
should have some CI code formatting checks. Additionally I really question
whether we need a Nutch custom code style at all… why don’t we just use
some other existing style and then enforce it?

I therefore propose that we replace the legacy code formatter with a
convention such as

* google Java format [1] which offers a GitHub action for easy integration
into our CI process, or
* check style [2] which offers an Ant task which we could use, this is of
less utility as we think about the move to grade
* superlinter [3] basically emerging as the industry OSS default, offers a
GitHub action and could also be configured to lint dockerfile, and other
artifacts. It can also be configured to use the google Java style as well…

My preference would be [3] because it offers a more comprehensive linting
package for the entire codebase not just the Java code.

Thanks for your consideration.
lewismc

[0]
https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
[1]
https://github.com/google/google-java-format
[2]
https://checkstyle.sourceforge.io/
[3]
https://github.com/marketplace/actions/super-linter

[jira] [Created] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-22 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3015:
---

 Summary: Add more CI steps to GitHub master-build.yml
 Key: NUTCH-3015
 URL: https://issues.apache.org/jira/browse/NUTCH-3015
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


With specific reference to the GitHub master-build.yml, we currently we run 
_*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
something fails it is unclear as to exactly what.

 

There are several improvements I want to propose to the GitHub CI
 * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
windows
 * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
and nightly targets
 * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Description: 
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank: Inverter_
 * _Nutch CrawlDb: + $crawldb_
 * _Nutch LinkDbReader: + $linkdb_

Thanks for any suggestions/comments.

  was:
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.


> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>    Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Description: 
There is a large degree of variability when we set the job name}}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_{*}Nutch ${ClassName}{*}: *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.

  was:
There is a large degree of variability when we set the job name{{{}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_*Nutch ${ClassName}* *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.


> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>    Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3014) Standardize Job names

2023-10-22 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3014:

Summary: Standardize Job names  (was: Standardize NutchJob job names)

> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name{{{}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _*Nutch ${ClassName}* *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-3013.
-
Resolution: Fixed

Thanks for the review [~snagel] 

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-21 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-3013.
---

> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Roll-Call for Apache Flagon

2023-10-21 Thread lewis john mcgibbney

I’m here.
lewismc

On Sat, Oct 21, 2023 at 08:28 Christofer Dutz 
wrote:

> Hi all,
>
>
>
> I was tasked at the last board report to pursue a roll call for Apache
> Flagon after we saw that a VOTE thread has currently been open for over 2
> weeks with only one vote (which was “-0”).
>
> Also seeing that only 2 people have done any commits in the last few
> months feels rather strange for a project that has been a TLP for only 7
> months now.
>
>
>
>
> Please reply to this thread if you’re still willing and able to contribute
> to this project.
>
>
>
> Thanks,
>
>  Chris
>

[jira] [Created] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3014:
---

 Summary: Standardize NutchJob job names
 Key: NUTCH-3014
 URL: https://issues.apache.org/jira/browse/NUTCH-3014
 Project: Nutch
  Issue Type: Improvement
  Components: configuration, runtime
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


There is a large degree of variability when we set the job name{{{}{}}}

 

{{Job job = NutchJob.getInstance(getConf());}}

{{job.setJobName("read " + segment);}}

 

Some examples mention the job name, others don't. Some use upper case, others 
don't, etc.

I think we can standardize the NutchJob job names. This would help when 
filtering jobs in YARN ResourceManager UI as well.

I propose we implement the following convention
 * *Nutch* (mandatory) - static value which prepends the job name, assists with 
distinguishing the Job as a NutchJob and making it easily findable.
 * *${ClassName}* (mandatory) - literally the name of the Class the job is 
encoded in
 * *${additional info}* (optional) - value could further distinguish the type 
of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)

_*Nutch ${ClassName}* *${additional info}*_

_Examples:_
 * _Nutch LinkRank Inverter_
 * _Nutch CrawlDb + $crawldb_
 * _Nutch LinkDbReader + $linkdb_

Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-20 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-3013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3013 started by Lewis John McGibbney.
---
> Employ commons-lang3's StopWatch to simplify timing logic
> -
>
> Key: NUTCH-3013
> URL: https://issues.apache.org/jira/browse/NUTCH-3013
> Project: Nutch
>  Issue Type: Improvement
>  Components: logging, runtime, util
>Affects Versions: 1.19
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: timing
> Fix For: 1.20
>
>
> I ended up running some experiments integrating Nutch and [Celeborn 
> (Incubating)|https://celeborn.apache.org/] and it got me thinking about 
> runtime timings. After some investigation I came across [common-lang3's 
> StopWatch 
> Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
>  which provides a convenient API for timings.
> Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
> could help us clean up some timing logic in Nutch. Specifically, it would 
> reduce redundancy in terms of duplicated code and logic. It would also open 
> the door to introduce timing _*splits*_ if anyone is so inclined to dig 
> deeper into runtime timings.
> A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
> hits for 32 files so it's fair to say that timing already affects lots of 
> aspects of the Nutch execution workflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-3013) Employ commons-lang3's StopWatch to simplify timing logic

2023-10-20 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created NUTCH-3013:
---

 Summary: Employ commons-lang3's StopWatch to simplify timing logic
 Key: NUTCH-3013
 URL: https://issues.apache.org/jira/browse/NUTCH-3013
 Project: Nutch
  Issue Type: Improvement
  Components: logging, runtime, util
Affects Versions: 1.19
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.20


I ended up running some experiments integrating Nutch and [Celeborn 
(Incubating)|https://celeborn.apache.org/] and it got me thinking about runtime 
timings. After some investigation I came across [common-lang3's StopWatch 
Class|https://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html?org/apache/commons/lang3/time/StopWatch.html]
 which provides a convenient API for timings.

Seeing as we already declare the commons-lang3 dependency, I think StopWatch 
could help us clean up some timing logic in Nutch. Specifically, it would 
reduce redundancy in terms of duplicated code and logic. It would also open the 
door to introduce timing _*splits*_ if anyone is so inclined to dig deeper into 
runtime timings.

A cursory search for *_"long start = System.currentTimeMillis();"_* returns 
hits for 32 files so it's fair to say that timing already affects lots of 
aspects of the Nutch execution workflow.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

No appenders could be found for logger (org.apache.celeborn.mapreduce.v2.app.MRAppMasterWithCeleborn)

2023-10-18 Thread lewis john mcgibbney

Hi user@,

I am making progress in my experiments integrating Nutch
1.20-SNAPSHOT, Hadoop 3.3.4 and Celeborn 0.4.0-SNAPSHOT-incubating!

In both the Hadoop work count example and with all of the Nutch
MapReduce jobs I run, I see the following output present in the YARN
container stderr log output

log4j:WARN No appenders could be found for logger
(org.apache.celeborn.mapreduce.v2.app.MRAppMasterWithCeleborn).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
for more info.

Looking into the Celeborn source [0] I see that Celeborn uses Slf4j
over Log4j2 but I am not sure how that plays with the above Hadoop
distribution. I think some further configuration is required...

lewismc

[0] 
https://github.com/apache/incubator-celeborn/blob/a5dfd67d5b9bcb7d5da59f441ed1d60b4bc27cd3/client-mr/mr/src/main/java/org/apache/celeborn/mapreduce/v2/app/MRAppMasterWithCeleborn.java#L50

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: java.lang.NumberFormatException: null when running Hadoop Mapreduce Wordcount example

2023-10-18 Thread Lewis John McGibbney

Hi Ethan,

Thanks for the advice!

As I am in an experimental phase, I decided to try again in pseudo-distributed 
mode...
I tried downgrading to Hadoop 3.2.1 (OpenJDK8) but apparently that Hadoop 
distribution doesn't run on Apple M1 chip!

I therefore tried again on Hadoop 3.3.4 and was successfully able to get the 
Hadoop MapReduce word count example running. I needed to configure and start 
YARN (which I did not do previously). 
I will create a pull request such that this is reflected in the Celeborn 
documentation.

I am running into other issues which I will create a new thread for.

Thank you
lewismc

On 2023/10/18 05:05:12 Ethan Feng wrote:
> Hi Lewis,
> 
> Sorry to hear that you're having trouble running the wordcount example
> with Celeborn.
> 
> Based on the information you shared, I would suggest you run MapReduce
> with Celeborn on a Hadoop cluster instead of pseudo-disturb mode.
> Celeborn client in MapReduce needs to write a config file into the
> Hadoop file system.
> 
> If that doesn't resolve the issue, please let me know and I'll be
> happy to help you troubleshoot further.
> 
> Celeborn has a Slack workspace if you are convenient to join. (
> https://join.slack.com/t/apachecelebor-kw08030/shared_invite/zt-1ju3hd5j8-4Z5keMdzpcVMspe4UJzF4Q
> )
> 
> Best regards,
> Ethan Feng
>

Re: [NEW FEATURE AVAILABLE] Celeborn support MapReduce engine.

2023-10-18 Thread Lewis John McGibbney

Excellent. Thanks for the heads up :) 
lewismc

On 2023/10/18 03:44:54 Ethan Feng wrote:
> Hi Lewis,
> 
> Thanks for reaching out.
> 
> I can confirm that future Celeborn releases will include the "mr"
> client jars since Celeborn 0.4.0 and it will start the release process
> in a short period.
> 
> If you have any further questions or concerns about using MapReduce
> with Celeborn, please don't hesitate to let me know.
> 
> Best regards,
> Ethan Feng

java.lang.NumberFormatException: null when running Hadoop Mapreduce Wordcount example

2023-10-17 Thread lewis john mcgibbney

Hi user@,

I cloned Celeborn (0.4.0-Incubating)
69defcad7f9423c9c24d2d22ead856b4225671c6 today and built it with the
-Pmr profile.

openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment Homebrew (build 11.0.20.1+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.20.1+0, mixed mode)

Apache Hadoop 3.3.4 running in pseudo-distrib mode.

I made an attempt to start MapReduce with Celeborn as documented at
https://celeborn.apache.org/docs/latest/#start-mapreduce-with-celeborn

Everything goes good until it fails when I attempt to run the
wordcount example. The exceptions and stack traces are available at
https://paste.apache.org/3vvy6.

Is it likely that this has to do with the Hadoop version or is this a
known issue?

Thanks in advance for any help.
lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: [NEW FEATURE AVAILABLE] Celeborn support MapReduce engine.

2023-10-17 Thread Lewis John McGibbney

Hi Ethan,

I'm just picking up Celeborn now and plan on running some experiments with the 
Apache Nutch (https://nutch.apache.org) project.

I downloaded Celeborn 0.3.1-incubating (2023-10-13) from the downloads page and 
noticed that no Celeborn client jars for MapReduce exist at 
$CELEBORN_HOME/mr/*.jar as suggested within the documentation at 
https://celeborn.apache.org/docs/latest/#add-celeborn-client-jar-to-mapreduces-classpath

I'm cloned the source (69defca) right now and built with 
./build/make-distribution.sh -Pmr, I now see the 'mr' directory which I can 
use...

Out of curiosity will future Celeborn releases include the "mr" client jars?

Thank you
lewismc

On 2023/09/14 11:43:25 Ethan Feng wrote:
> Hello developers and users,
>   I am glad to announce that Celeborn supports the MapReduce engine
> now.  Both Hadoop 2 and 3 are supported. If you are interested, you
> can just try it and feedback on anything you want.
> 
> 
>   The quick start guide can be found here [
> https://celeborn.apache.org/docs/latest/ ].
>   The design doc can be found here [
> https://docs.google.com/document/d/1g4irlBucIAFNI42cFSuOVWYqOWSvuqpw_VBHmyyv8zo/edit?usp=sharing
> ].
> 
> Thanks,
> Ethan.
>

Establishing a Nutch development roadmap

2023-09-26 Thread lewis john mcgibbney

Hi dev@,

I've been at arms length for a while as $dayjob changed and then
changed again over the last number of years.

With that being said, I wanted to start a thread on $title with the
goal of establishing some "big items" we could put on the roadmap and
maybe even publish...

Here are some of the thing's I've been thinking about (unordered)

* NUTCH-2940 Develop Gradle Core Build for Apache Nutch
* Metrics system integration cf. https://github.com/apache/nutch/pull/712
* Upgrading Javac version > 11
* Trade study to consider integrating (something like) Plugin
Framework for Java (PF4J) into Nutch
* porting Nutch to run on Apache Beam https://beam.apache.org/

Does anyone else have candidates they wish to add?

Thanks for your consideration.

lewismc


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-14 Thread lewis john mcgibbney

+1 Tim.


On Wed, Sep 13, 2023 at 16:50 

>
>
>
> -- Forwarded message --
> From: Tim Allison 
> To: user@nutch.apache.org, d...@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 13 Sep 2023 10:50:08 -0400
> Subject: [DISCUSS] Removing Any23 from Nutch?
> All,
>   I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks
> ago.  Any23 was moved to the attic in June. Unless there are objections, I
> propose removing it from Nutch before the next release.
>   Any objections?
>
>Best,
>
>Tim
>

Yahoo's Burst

2023-05-18 Thread lewis john mcgibbney

Hi user@,
I stumbled across Burst today...
It looks like it is under active development and the documentation is
lacking for loading data via a client.
https://github.com/yahoo/burst
lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: [VOTE] Move OODT to Attic?

2023-04-05 Thread Lewis John McGibbney

+1 move to the attic.
I share Sean's sentiment entirely. A real success story.
Thanks Imesha for representing the project to the Board.
lewismc

On 2023/04/03 01:02:01 Imesha Sudasingha wrote:
> Hello everyone:
> 
> Due to inactivity, Apache OODT is considering moving to the Attic [1]. This
> email serves as a call to all PMC members to vote whether to retire OODT to
> the attic, or not.  Note that three -1 votes will be sufficient to cancel
> retirement to the attic no matter how many +1 votes there are.
> 
> PMC members, please reply to this email with your vote:
> 
> +1 [ ] I wish for Apache OODT to be retired to the Apache Attic
> +0 [ ] I do not care
> -1 [ ] Apache OODT should not be retired to the Attic
> 
> Here's my +1.
> 
> Thanks,
> Imesha
> 
> [1] https://attic.apache.org/
>

Re: FLAGON IS A TOP LEVEL PROJECT

2023-03-23 Thread lewis john mcgibbney

Congrats community.
lewismc

On Wed, Mar 22, 2023 at 19:55 Joshua Poore  wrote:

> All,
>
> I’m so excited to tell you that the ASF Board unanimously approved the
> resolution to establish Apache Flagon as an ASF Top Level Project.
>
> HUGE thanks to our community—PMC, committers, contributors, users. Apache
> projects are built from communities. Thank You!
>
> I also want to congratulate @Jyyjy—his recent pull request is our first
> official commit to Master as a TLP!
>
> PMC will work to migrate Apache Flagon from incubator to an autonomous
> TLP. Stay Tuned!
>
> Thanks to all of you—you all made this possible.
>
>
> Respectfully,
>
> Josh (VP, Apache Flagon)
>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: FLAGON IS A TOP LEVEL PROJECT

2023-03-23 Thread lewis john mcgibbney

Congrats community.
lewismc

On Wed, Mar 22, 2023 at 19:55 Joshua Poore  wrote:

> All,
>
> I’m so excited to tell you that the ASF Board unanimously approved the
> resolution to establish Apache Flagon as an ASF Top Level Project.
>
> HUGE thanks to our community—PMC, committers, contributors, users. Apache
> projects are built from communities. Thank You!
>
> I also want to congratulate @Jyyjy—his recent pull request is our first
> official commit to Master as a TLP!
>
> PMC will work to migrate Apache Flagon from incubator to an autonomous
> TLP. Stay Tuned!
>
> Thanks to all of you—you all made this possible.
>
>
> Respectfully,
>
> Josh (VP, Apache Flagon)
>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: Tika server crashes

2023-03-20 Thread Lewis John McGibbney

Bit of a plug for tika-helm here folks...
Horizontal pod autoscaling [0] is available (off by default) and can be 
configured via values.yaml or overridden on the CLI.
This would mean that the availability to a tika-server would still be available 
in the event that one particular pod went down due to OOM.
See [1] for more details.
lewismc

[0] 
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
[1] https://github.com/apache/tika-helm/blob/main/values.yaml#L99-L105

On 2023/03/08 20:29:49 Tim Allison wrote:
> HIT_MAX_FILES is expected.  We designed that in to periodically
> restart the server to avoid memory leaks in badly behaving parsers.
> You can configure a value for the max file threshold if necessary.
> 
> The restart failed, and that's a problem.  Let me look into the code,
> I thought we offered more grace than 6 seconds to restart the server.
> 
> Can you share any server settings in your config.xml?
> 
> Please remember that as of 2.x tika-server will shutdown on oom,
> timeouts and max_files, and clients should be able to handle waiting
> for tika-server restarts.
> 
> On Wed, Mar 8, 2023 at 2:58 PM Konstantin Gribov  wrote:
> >
> > Hello, Artur.
> >
> > How many concurrent requests did you have and are you running Tika Server 
> > on Windows? And what kind of files did you use?
> >
> > You may have hit number of open files limit due to lot of reasons starting 
> > from known Windows issue (JVM process holds file descriptors for mmaped 
> > files until process killed) through just too low nofile limit to some Tika 
> > bug with handling for example stdin/stdout for forked processes.
> >
> > Could you provide jvm thread dump and lsof output (or Windows analog)?
> >
> > --
> > Best regards,
> > Konstantin Gribov.
> >
> >
> > On Wed, Mar 8, 2023 at 4:26 PM Artur Auhatov via user 
> >  wrote:
> >>
> >> Hello!
> >>
> >> I have a few questions related to Tika server.
> >>
> >>
> >>
> >> We’ve started using tika server in our environment. While testing 
> >> reliability of the tika-server we found it crashes during fork and can’t 
> >> fork anymore until restart the main process. Is this known problem?
> >>
> >> In order to start tika-server we use command: java -jar 
> >> tika-server-2.7.0.jar -c myconfig.xml
> >>
> >>
> >>
> >> There is a log message:
> >>
> >>
> >>
> >> 14:50:30,272 [INFO] [Thread-9]  - Shutting down forked process with 
> >> status: HIT_MAX_FILES [org.apache.tika.server.core.ServerStatusWatcher]
> >>
> >> INFO  [pool-2-thread-1] 14:50:30,816 
> >> org.apache.tika.server.core.TikaServerWatchDog forked process exited with 
> >> exit value 2
> >>
> >> INFO  [pool-2-thread-1] 14:50:36,876 
> >> org.apache.tika.server.core.TikaServerWatchDog about to shutdown process
> >>
> >> ERROR [main] 14:50:36,878 org.apache.tika.server.core.TikaServerCli Can't 
> >> start:
> >>
> >> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> >> Forked process failed to start after 6022 (ms)
> >>
> >> at 
> >> java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?]
> >>
> >> at 
> >> java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:121) 
> >> ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:93) 
> >> ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:80) 
> >> ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> Caused by: java.lang.RuntimeException: Forked process failed to start 
> >> after 6022 (ms)
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:316)
> >>  ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:287)
> >>  ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerWatchDog.startForkedProcess(TikaServerWatchDog.java:224)
> >>  ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:143)
> >>  ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:53)
> >>  ~[tika-server-standard-2.7.0.jar:2.7.0]
> >>
> >> at 
> >> java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> >>
> >> at 
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 
> >> ~[?:?]
> >>
> >> at 
> >> java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> >>
> >> at 
> >>

[jira] [Updated] (TIKA-3989) Upgrade tika-helm Horizontal Pod Autoscaling from to autoscaling/v2beta1 to autoscaling/v2

2023-03-20 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3989:
---
Description: The _*autoscaling/v2beta1*_ API is superseded with 
{_}*autoscaling/v2*{_}. This is documented thoroughly at 
[https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/]
  (was: The _*autoscaling/v2beta1*_ API is superseded with autoscaling/v2. This 
is documented thoroughly at 
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/)

> Upgrade tika-helm Horizontal Pod Autoscaling from to autoscaling/v2beta1 to 
> autoscaling/v2
> --
>
> Key: TIKA-3989
> URL: https://issues.apache.org/jira/browse/TIKA-3989
> Project: Tika
>  Issue Type: Task
>  Components: helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>
> The _*autoscaling/v2beta1*_ API is superseded with {_}*autoscaling/v2*{_}. 
> This is documented thoroughly at 
> [https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3989) Upgrade tika-helm Horizontal Pod Autoscaling from to autoscaling/v2beta1 to autoscaling/v2

2023-03-20 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created TIKA-3989:
--

 Summary: Upgrade tika-helm Horizontal Pod Autoscaling from to 
autoscaling/v2beta1 to autoscaling/v2
 Key: TIKA-3989
 URL: https://issues.apache.org/jira/browse/TIKA-3989
 Project: Tika
  Issue Type: Task
  Components: helm
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney


The _*autoscaling/v2beta1*_ API is superseded with autoscaling/v2. This is 
documented thoroughly in 

[https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/|https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3989) Upgrade tika-helm Horizontal Pod Autoscaling from to autoscaling/v2beta1 to autoscaling/v2

2023-03-20 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3989:
---
Description: The _*autoscaling/v2beta1*_ API is superseded with 
autoscaling/v2. This is documented thoroughly at 
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
  (was: The _*autoscaling/v2beta1*_ API is superseded with autoscaling/v2. This 
is documented thoroughly in 

[https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/|https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/)

> Upgrade tika-helm Horizontal Pod Autoscaling from to autoscaling/v2beta1 to 
> autoscaling/v2
> --
>
> Key: TIKA-3989
> URL: https://issues.apache.org/jira/browse/TIKA-3989
> Project: Tika
>  Issue Type: Task
>  Components: helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Minor
>
> The _*autoscaling/v2beta1*_ API is superseded with autoscaling/v2. This is 
> documented thoroughly at 
> https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (TIKA-3988) Add Github Action to Lint and Test Charts

2023-03-20 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed TIKA-3988.
--

> Add Github Action to Lint and Test Charts
> -
>
> Key: TIKA-3988
> URL: https://issues.apache.org/jira/browse/TIKA-3988
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Affects Versions: 2.7.0
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> The [chart-testing-action|https://github.com/helm/chart-testing-action] will 
> improve CI for the tika-helm. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3988) Add Github Action to Lint and Test Charts

2023-03-20 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-3988.

Resolution: Fixed

> Add Github Action to Lint and Test Charts
> -
>
> Key: TIKA-3988
> URL: https://issues.apache.org/jira/browse/TIKA-3988
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Affects Versions: 2.7.0
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> The [chart-testing-action|https://github.com/helm/chart-testing-action] will 
> improve CI for the tika-helm. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[ANNOUNCEMENT] Apache Tika Helm Chart v2.7.0 and v2.7.0-full released

2023-03-19 Thread lewis john mcgibbney

The Tika PMC is happy to announce that tika-helm v2.7.0 and
v2.7.0-full Charts are now available.

Documentation can be found at https://github.com/apache/tika-helm#readme

Please register support and feedback at
https://github.com/apache/tika-helm/pulls

Thanks to everyone who contributed to these releases.

Happy Helm'ing...

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[ANNOUNCEMENT] Apache Tika Helm Chart v2.7.0 and v2.7.0-full released

2023-03-19 Thread lewis john mcgibbney

The Tika PMC is happy to announce that tika-helm v2.7.0 and
v2.7.0-full Charts are now available.

Documentation can be found at https://github.com/apache/tika-helm#readme

Please register support and feedback at
https://github.com/apache/tika-helm/pulls

Thanks to everyone who contributed to these releases.

Happy Helm'ing...

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[jira] [Commented] (TIKA-3988) Add Github Action to Lint and Test Charts

2023-03-19 Thread Lewis John McGibbney (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702421#comment-17702421
 ] 

Lewis John McGibbney commented on TIKA-3988:


It looks like there are some permissions issues which needs to be configured 
before the Github action can be run. I got in touch with INFRA about this. The 
Github Action output is as follows
{quote}
Error: .github#L1
helm/chart-testing-action@v2.3.1 and helm/kind-action@v1.4.0 are not allowed to 
be used in apache/tika-helm. Actions in this workflow must be: within a 
repository owned by apache, created by GitHub, verified in the GitHub 
Marketplace, or matching the following: 
{*}/{*}@[a-f0-9][a-f0-9][a-f0-9][a-f0-9][a-f0-9][a-f0-9][a-f0-9]+, 
AdoptOpenJDK/install-jdk@{*}, 
JamesIves/github-pages-deploy-action@5dc1d5a192aeb5ab5b7d5a77b7d36aea4a7f5c92, 
TobKed/label-when-approved-action@{*}, actions-cool/issues-helper@{*}, 
actions-rs/{*}, al-cheb/configure-pagefile-action@{*}, 
amannn/action-semantic-pull-request@{*}, apache/{*}, 
burrunan/gradle-cache-action@{*}, bytedeco/javacpp-presets/.github/actions/{*}, 
chromaui/action@{*}, codecov/codecov-action@{*}, 
conda-incubator/setup-miniconda@{*}, container-tools/kind-action@{*}, 
container-tools/microshift-action@{*}, dawidd6/action-download-artifact@{*}, 
delaguardo/setup-graalvm@{*}, docker://jekyll/jekyll:{*}, 
docker://pandoc/core:2.9, eps1lon/actions-label-merge-conflict@{*}, 
gaurav-nelson/gith...
{quote}

> Add Github Action to Lint and Test Charts
> -
>
> Key: TIKA-3988
> URL: https://issues.apache.org/jira/browse/TIKA-3988
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>Affects Versions: 2.7.0
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> The [chart-testing-action|https://github.com/helm/chart-testing-action] will 
> improve CI for the tika-helm. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TIKA-3988) Add Github Action to Lint and Test Charts

2023-03-19 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created TIKA-3988:
--

 Summary: Add Github Action to Lint and Test Charts
 Key: TIKA-3988
 URL: https://issues.apache.org/jira/browse/TIKA-3988
 Project: Tika
  Issue Type: Improvement
  Components: helm
Affects Versions: 2.7.0
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.7.0


The [chart-testing-action|https://github.com/helm/chart-testing-action] will 
improve CI for the tika-helm. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3985) Automate tika-helm Chart releases with helm/chart-releaser-action

2023-03-19 Thread Lewis John McGibbney (Jira)



[ 
https://issues.apache.org/jira/browse/TIKA-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702402#comment-17702402
 ] 

Lewis John McGibbney commented on TIKA-3985:


https://github.com/marketplace/actions/jfrog-cli-for-github-actions
https://github.com/helm/chart-releaser-action

> Automate tika-helm Chart releases with helm/chart-releaser-action 
> --
>
> Key: TIKA-3985
> URL: https://issues.apache.org/jira/browse/TIKA-3985
> Project: Tika
>  Issue Type: Improvement
>  Components: helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> I've received several requests for 
> [tika-helm|https://github.com/apache/tika-helm] releases to shadow 
> [tika-docker|https://github.com/apache/tika-docker].
> I found a Github action which will enable that. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Userale Schema

2023-03-15 Thread lewis john mcgibbney

Big +1 one this. Would be useful as we are thinking about potentially
pushing data into OpenSearch in the future.
A schema and data types would be very useful.
Lewis

On Wed, Mar 15, 2023 at 1:48 PM Gedd Johnson  wrote:
>
> Hi all,
>
> As discussed in this PR, we'd like to ideate on the topic of implementing a 
> schema for the Userale client payloads that are sent to backend servers.
>
> First stab at a problem statement: Userale in its current state does not 
> implement any sort of schema for its payloads. Changes to the payload's shape 
> (as referenced in the PR linked above) can break data pipelines for 
> downstream users. How might we:
>
> 1. Validate and version a schema so that downstream users know the shape of 
> data they will receive
>
> 2. Maintain the flexible schema management that Userale currently offers
>
> Looking forward to the discussion!
>
> Best,
> Gedd Johnson
>

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

[jira] [Created] (TIKA-3985) Automate tika-helm Chart releases with helm/chart-releaser-action

2023-03-10 Thread Lewis John McGibbney (Jira)

Lewis John McGibbney created TIKA-3985:
--

 Summary: Automate tika-helm Chart releases with 
helm/chart-releaser-action 
 Key: TIKA-3985
 URL: https://issues.apache.org/jira/browse/TIKA-3985
 Project: Tika
  Issue Type: Improvement
  Components: helm
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.7.0


I've received several requests for 
[tika-helm|https://github.com/apache/tika-helm] releases to shadow 
[tika-docker|https://github.com/apache/tika-docker].
I found a Github action which will enable that. PR coming up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3452) java.nio.file.FileSystemException Read-only file system

2023-03-03 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3452:
---
Fix Version/s: 2.7.0
   (was: 2.0.0-BETA)

> java.nio.file.FileSystemException Read-only file system
> ---
>
> Key: TIKA-3452
> URL: https://issues.apache.org/jira/browse/TIKA-3452
> Project: Tika
>  Issue Type: Bug
>  Components: docker, helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> The following ExecutionException is thrown when I attempt to run [tika-docker 
> 2.0.0-BETA|https://hub.docker.com/layers/apache/tika/2.0.0-BETA-full/images/sha256-2d735f7bdf86e618a5390d92614a310697f9134d11a2b2e4c1c0cfcde1f68b1d?context=explore]
> {code:bash}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.util.concurrent.ExecutionException: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116)
>   at 
> org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88)
>   at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
> Caused by: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
>   at java.base/java.nio.file.Files.newByteChannel(Files.java:375)
>   at java.base/java.nio.file.Files.createFile(Files.java:652)
>   at 
> java.base/java.nio.file.TempFileHelper.create(TempFileHelper.java:137)
>   at 
> java.base/java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:160)
>   at java.base/java.nio.file.Files.createTempFile(Files.java:917)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:220)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:210)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:117)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {code}
> There are differences/improvements in the way the [tika-server child process 
> is 
> spawned|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks]
>  in the 2.0.0-BETA docker image. I am investigating a fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (TIKA-3452) java.nio.file.FileSystemException Read-only file system

2023-03-03 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved TIKA-3452.

Resolution: Fixed

> java.nio.file.FileSystemException Read-only file system
> ---
>
> Key: TIKA-3452
> URL: https://issues.apache.org/jira/browse/TIKA-3452
> Project: Tika
>  Issue Type: Bug
>  Components: docker, helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.7.0
>
>
> The following ExecutionException is thrown when I attempt to run [tika-docker 
> 2.0.0-BETA|https://hub.docker.com/layers/apache/tika/2.0.0-BETA-full/images/sha256-2d735f7bdf86e618a5390d92614a310697f9134d11a2b2e4c1c0cfcde1f68b1d?context=explore]
> {code:bash}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.util.concurrent.ExecutionException: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116)
>   at 
> org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88)
>   at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
> Caused by: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
>   at java.base/java.nio.file.Files.newByteChannel(Files.java:375)
>   at java.base/java.nio.file.Files.createFile(Files.java:652)
>   at 
> java.base/java.nio.file.TempFileHelper.create(TempFileHelper.java:137)
>   at 
> java.base/java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:160)
>   at java.base/java.nio.file.Files.createTempFile(Files.java:917)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:220)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:210)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:117)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {code}
> There are differences/improvements in the way the [tika-server child process 
> is 
> spawned|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks]
>  in the 2.0.0-BETA docker image. I am investigating a fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (TEZ-4371) Implement ClientServiceDelegate.getJobCounters

2023-02-28 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TEZ-4371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned TEZ-4371:
-

Assignee: Lewis John McGibbney

> Implement ClientServiceDelegate.getJobCounters
> --
>
> Key: TEZ-4371
> URL: https://issues.apache.org/jira/browse/TEZ-4371
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>    Assignee: Lewis John McGibbney
>Priority: Major
>
> Details are 
> [here|https://issues.apache.org/jira/browse/NUTCH-2839?focusedCommentId=17471115=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17471115]
> currently when tez ClientProtocol intercepts MR job submission (YARNRunner), 
> the collection of counters is not implemented
> {code}
>   public Counters getJobCounters(JobID jobId)
>   throws IOException, InterruptedException {
> // FIXME needs counters support from DAG
> // with a translation layer on client side
> Counters empty = new Counters();
> return empty;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] (TEZ-4371) Implement ClientServiceDelegate.getJobCounters

2023-02-28 Thread Lewis John McGibbney (Jira)



[ https://issues.apache.org/jira/browse/TEZ-4371 ]


Lewis John McGibbney deleted comment on TEZ-4371:
---

was (Author: lewismc):
[~abstractdog] I have to finish off NUTCH-2856 then I could make an effort to 
investigate and implement this improvement. I'll write here once I finish 
NUTCH-2856.

> Implement ClientServiceDelegate.getJobCounters
> --
>
> Key: TEZ-4371
> URL: https://issues.apache.org/jira/browse/TEZ-4371
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Priority: Major
>
> Details are 
> [here|https://issues.apache.org/jira/browse/NUTCH-2839?focusedCommentId=17471115=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17471115]
> currently when tez ClientProtocol intercepts MR job submission (YARNRunner), 
> the collection of counters is not implemented
> {code}
>   public Counters getJobCounters(JobID jobId)
>   throws IOException, InterruptedException {
> // FIXME needs counters support from DAG
> // with a translation layer on client side
> Counters empty = new Counters();
> return empty;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj

2023-02-28 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2856:
---

Assignee: (was: Lewis John McGibbney)

> Implement a protocol-smb plugin based on hierynomus/smbj
> 
>
> Key: NUTCH-2856
> URL: https://issues.apache.org/jira/browse/NUTCH-2856
> Project: Nutch
>  Issue Type: New Feature
>  Components: external, plugin, protocol
>Reporter: Hiran Chaudhuri
>Priority: Major
> Fix For: 1.20
>
>
> The plugin protocol-smb advertized on 
> [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually 
> refers to the JCIFS library. According to this library's homepage 
> [https://www.jcifs.org/]:
> _If you're looking for the latest and greatest open source Java SMB library, 
> this is not it. JCIFS has been in maintenance-mode-only for several years and 
> although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and 
> various utility classes), jCIFS does not support the newer SMB2/3 variants of 
> the SMB protocol which is slowly becoming required (Windows 10 requires 
> SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their 
> products. *So if SMB1 is disabled on your network, JCIFS' file related 
> operations will NOT work.*_
> Looking at 
> [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1]
> _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June 
> 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators 
> Update do not have SMB1 installed by default._
> As a conclusion, the chances that SMB1 protocol is installed and/or 
> configured are getting vastly smaller. Therefore some migration towards 
> SMB2/3 is required. Luckily the JCIFS homepage lists alternatives:
>  * [jcifs-codelibs|https://github.com/codelibs/jcifs]
>  * [jcifs-ng|https://github.com/AgNO3/jcifs-ng]
>  * [smbj|https://github.com/hierynomus/smbj]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Lewis John McGibbney (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694741#comment-17694741
 ] 

Lewis John McGibbney commented on NUTCH-2988:
-

Actually, digging deeper it looks like the v7.13.2 we consume is licensed under 
[Elastic License 
2.0|https://raw.githubusercontent.com/elastic/elasticsearch/v7.13.2/licenses/ELASTIC-LICENSE-2.0.txt].
 This is confirmed by
# 
https://central.sonatype.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.2,
 and
# 
https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client/7.13.2

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client as opposed to the main search project still actually ASL 
> 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2988) Elasticsearch 7.13.2 compatible with ASL 2.0?

2023-02-28 Thread Lewis John McGibbney (Jira)



[ 
https://issues.apache.org/jira/browse/NUTCH-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694736#comment-17694736
 ] 

Lewis John McGibbney commented on NUTCH-2988:
-

It looks the the [elasticsearch-java 
client|https://github.com/elastic/elasticsearch-java/blob/v8.6.2/LICENSE.txt]'s 
are licensed under ALv2.0.

> Elasticsearch 7.13.2 compatible with ASL 2.0?
> -
>
> Key: NUTCH-2988
> URL: https://issues.apache.org/jira/browse/NUTCH-2988
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> In the latest release of at least the 1.x branch of Nutch, the elasticsearch 
> high level java client is at 7.13.2, which is after the great schism.  Or, 
> the last purely ASL 2.0 license was in 7.10.2.
> So, do we need to downgrade to 7.10.2 or is Elasticsearch's new licensing 
> plan suitable to be released within an ASF project?
> Or, is the client still actually ASL 2.0?
> Ref: https://github.com/elastic/elasticsearch/blob/v7.13.2/LICENSE.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TIKA-3452) java.nio.file.FileSystemException Read-only file system

2023-02-15 Thread Lewis John McGibbney (Jira)



 [ 
https://issues.apache.org/jira/browse/TIKA-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-3452:
---
Summary: java.nio.file.FileSystemException Read-only file system  (was: 
java.nio.file.FileSystemException Read-only file system in 2.0.0-BETA 
tika-docker)

> java.nio.file.FileSystemException Read-only file system
> ---
>
> Key: TIKA-3452
> URL: https://issues.apache.org/jira/browse/TIKA-3452
> Project: Tika
>  Issue Type: Bug
>  Components: docker, helm
>    Reporter: Lewis John McGibbney
>    Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 2.0.0-BETA
>
>
> The following ExecutionException is thrown when I attempt to run [tika-docker 
> 2.0.0-BETA|https://hub.docker.com/layers/apache/tika/2.0.0-BETA-full/images/sha256-2d735f7bdf86e618a5390d92614a310697f9134d11a2b2e4c1c0cfcde1f68b1d?context=explore]
> {code:bash}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.util.concurrent.ExecutionException: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> org.apache.tika.server.core.TikaServerCli.mainLoop(TikaServerCli.java:116)
>   at 
> org.apache.tika.server.core.TikaServerCli.execute(TikaServerCli.java:88)
>   at org.apache.tika.server.core.TikaServerCli.main(TikaServerCli.java:66)
> Caused by: java.nio.file.FileSystemException: 
> /tmp/apache-tika-server-forked-tmp-8374629799942405236: Read-only file system
>   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>   at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>   at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
>   at java.base/java.nio.file.Files.newByteChannel(Files.java:375)
>   at java.base/java.nio.file.Files.createFile(Files.java:652)
>   at 
> java.base/java.nio.file.TempFileHelper.create(TempFileHelper.java:137)
>   at 
> java.base/java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:160)
>   at java.base/java.nio.file.Files.createTempFile(Files.java:917)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:220)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog$ForkedProcess.(TikaServerWatchDog.java:210)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:117)
>   at 
> org.apache.tika.server.core.TikaServerWatchDog.call(TikaServerWatchDog.java:50)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
>   at java.base/java.lang.Thread.run(Thread.java:832)
> {code}
> There are differences/improvements in the way the [tika-server child process 
> is 
> spawned|https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MakingTikaServerRobusttoOOMs,InfiniteLoopsandMemoryLeaks]
>  in the 2.0.0-BETA docker image. I am investigating a fix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [ROLL CALL] Project status of Any23

2023-02-13 Thread Lewis John McGibbney

Hi,
I am here but have been busy doing other tasks right now.
The project has been very quiet for quite a while and this has been reported to 
the Board in many previous reports.
I neglected to submit reports for two or three months but will attempt to 
rectify that.
I also try to push releases and ensure that dspenency updates are made.
lewismc

On 2023/01/19 02:10:56 Willem Jiang wrote:
> Hi,
> 
> There has been no development activity on this project for more than 6
> months. The PMC failed to submit several board reports. Without a
> community or anyone working on the project, there is no project.
> 
> If any of the PMC members are still active, please indicate so by
> responding to this email.
> 
> Thank you, all!  I hope everyone is safe and healthy.
> 
> Willem Jiang
>

Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

2022-11-08 Thread lewis john mcgibbney

Hi Mike,

Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or additions. You can
use the parser checker tool to validate your change before creating the PR.
Thanks
lewismc

On Tue, Nov 8, 2022 at 02:16  wrote:

>
> -- Forwarded message --
> From: Mike 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 8 Nov 2022 11:15:51 +0100
> Subject: Incomplete TLD List
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
> "url":"https://about.google/intl/en_FR/how-our-business-works/;,
> "tstamp":"2022-11-06T17:22:14.808Z",
> "domain":"google",
> "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Re: [ VOTE ] Graduation of Flagon Project

2022-11-05 Thread lewis john mcgibbney

Excellent, please close the thread off with a RESULT title and then tally
the VOTE’s and who VOTE’d.
Thanks

On Sat, Nov 5, 2022 at 15:37 Austin Bennett 
wrote:

> On this thread, we have 6 +1s [ also another on the incubator thread ] and
> no 0, or -1 votes, so* the VOTE passes*, and we can confidently say the
> community is in favor of Graduation.
>
>
> On Thu, Nov 3, 2022 at 8:00 AM Evan Jones  wrote:
>
>> + 1
>>
>> Best
>>
>> Evan Jones
>> Website: www.ea-jones.com
>>
>>
>> On Thu, Nov 3, 2022 at 4:44 AM Furkan KAMACI 
>> wrote:
>>
>> > Definitely +1!
>> >
>> > On Tue, Nov 1, 2022 at 7:58 PM Amir Ghaemi  wrote:
>> >
>> >> +1
>> >>
>> >> Best Regards,
>> >> *Amir M. Ghaemi*
>> >>
>> >>
>> >> On Tue, Nov 1, 2022 at 7:06 AM Gedd Johnson 
>> wrote:
>> >>
>> >> > +1
>> >> >
>> >> > Best,
>> >> > Gedd Johnson
>> >> >
>> >> > On Mon, Oct 31, 2022 at 23:22 Joshua Poore 
>> wrote:
>> >> >
>> >> > > Emphatic +1 for me.
>> >> > >
>> >> > > Sincerely,
>> >> > >
>> >> > > Josh
>> >> > >
>> >> > >
>> >> > > On Oct 31, 2022, at 5:13 PM, lewis john mcgibbney <
>> lewi...@apache.org
>> >> >
>> >> > > wrote:
>> >> > >
>> >> > > +1
>> >> > >
>> >> > > On Mon, Oct 31, 2022 at 09:31 Austin Bennett 
>> >> wrote:
>> >> > >
>> >> > >> Hi Flagon Community,
>> >> > >> +1
>> >> > >> Given recent discussions around the graduation status of the
>> >> project, it
>> >> > >> is time to work through the process.  We have had a recent
>> discussion
>> >> > >> on-list, and consensus seems to be in favor of graduation.  The
>> next
>> >> > step
>> >> > >> seems to be a recommendation that we make an official VOTE, per:
>> >> > >> https://incubator.apache.org/guides/graduation
>> >> > >> .html#community_graduation_vote
>> >> > >>
>> >> > >> *Please VOTE* for the actual record.  I will also let the
>> Incubator
>> >> know
>> >> > >> the vote is occurring [ per the link above ], and imagine that I
>> will
>> >> > tally
>> >> > >> the votes later on Friday the 4th to allow for >= 72 hours; we'll
>> see
>> >> > how
>> >> > >> this thread goes.
>> >> > >>
>> >> > >> Per https://www.apache.org/foundation/voting.html* ideally votes
>> >> will
>> >> > >> be +1, 0, or -1* And, as I understand it, only IPMC votes are
>> >> binding,
>> >> > >> as found in
>> https://incubator.apache.org/guides/participation.html.
>> >> > >>
>> >> > >>
>> >> > >> Please consider this my +1.
>> >> > >>
>> >> > >> The existing community has demonstrated addressing the
>> requirements
>> >> as
>> >> > >> found in the incubator guidelines for graduation.  Graduation is
>> an
>> >> > >> important milestone signaling project maturity, and a great step
>> >> towards
>> >> > >> ongoing growth and evolution.
>> >> > >>
>> >> > >> Cheers -
>> >> > >> Austin
>> >> > >>
>> >> > >> --
>> >> > > http://home.apache.org/~lewismc/
>> >> > > http://people.apache.org/keys/committer/lewismc
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >>
>> >
>>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 12146 matches

Mail list logo