[jira] [Created] (NUTCH-3074) Augment Javadoc for org/apache/nutch/protocol/Content.java
Lewis John McGibbney created NUTCH-3074: --- Summary: Augment Javadoc for org/apache/nutch/protocol/Content.java Key: NUTCH-3074 URL: https://issues.apache.org/jira/browse/NUTCH-3074 Project: Nutch Issue Type: Improvement Components: protocol Reporter: Lewis John McGibbney Fix For: 1.21 [~hiranchaudhuri]'s [question on user@|https://lists.apache.org/thread/6o0zsbjp9s5yn0pfkzh9rzjb09hnvh0c] prompted me to open the ticket. In short, we should augment the default and overloaded constructors in [Content.java|org/apache/nutch/protocol/Content.java] with Javadoc. This would help developers who are looking to implement Protocol plugins. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3074) Augment Javadoc for org/apache/nutch/protocol/Content.java
[ https://issues.apache.org/jira/browse/NUTCH-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3074: Description: [~hiranchaudhuri]'s [question on user@|https://lists.apache.org/thread/6o0zsbjp9s5yn0pfkzh9rzjb09hnvh0c] prompted me to open the ticket. In short, we should augment the default and overloaded constructors in [Content.java|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java] with Javadoc. This would help developers who are looking to implement Protocol plugins. was: [~hiranchaudhuri]'s [question on user@|https://lists.apache.org/thread/6o0zsbjp9s5yn0pfkzh9rzjb09hnvh0c] prompted me to open the ticket. In short, we should augment the default and overloaded constructors in [Content.java|org/apache/nutch/protocol/Content.java] with Javadoc. This would help developers who are looking to implement Protocol plugins. > Augment Javadoc for org/apache/nutch/protocol/Content.java > -- > > Key: NUTCH-3074 > URL: https://issues.apache.org/jira/browse/NUTCH-3074 > Project: Nutch > Issue Type: Improvement > Components: protocol >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > [~hiranchaudhuri]'s [question on > user@|https://lists.apache.org/thread/6o0zsbjp9s5yn0pfkzh9rzjb09hnvh0c] > prompted me to open the ticket. > In short, we should augment the default and overloaded constructors in > [Content.java|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Content.java] > with Javadoc. This would help developers who are looking to implement > Protocol plugins. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj
[ https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886549#comment-17886549 ] Lewis John McGibbney commented on NUTCH-2856: - Sorry for late response. I assigned this to you [~hiranchaudhuri] and will check the patch out in the short term. Thanks for the contributions (y) > Implement a protocol-smb plugin based on hierynomus/smbj > > > Key: NUTCH-2856 > URL: https://issues.apache.org/jira/browse/NUTCH-2856 > Project: Nutch > Issue Type: New Feature > Components: external, plugin, protocol >Reporter: Hiran Chaudhuri >Assignee: Hiran Chaudhuri >Priority: Major > Fix For: 1.21 > > > The plugin protocol-smb advertized on > [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually > refers to the JCIFS library. According to this library's homepage > [https://www.jcifs.org/]: > _If you're looking for the latest and greatest open source Java SMB library, > this is not it. JCIFS has been in maintenance-mode-only for several years and > although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and > various utility classes), jCIFS does not support the newer SMB2/3 variants of > the SMB protocol which is slowly becoming required (Windows 10 requires > SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their > products. *So if SMB1 is disabled on your network, JCIFS' file related > operations will NOT work.*_ > Looking at > [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1] > _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June > 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators > Update do not have SMB1 installed by default._ > As a conclusion, the chances that SMB1 protocol is installed and/or > configured are getting vastly smaller. Therefore some migration towards > SMB2/3 is required. Luckily the JCIFS homepage lists alternatives: > * [jcifs-codelibs|https://github.com/codelibs/jcifs] > * [jcifs-ng|https://github.com/AgNO3/jcifs-ng] > * [smbj|https://github.com/hierynomus/smbj] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj
[ https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2856: --- Assignee: Hiran Chaudhuri (was: Hiran Chaudhuri) > Implement a protocol-smb plugin based on hierynomus/smbj > > > Key: NUTCH-2856 > URL: https://issues.apache.org/jira/browse/NUTCH-2856 > Project: Nutch > Issue Type: New Feature > Components: external, plugin, protocol >Reporter: Hiran Chaudhuri >Assignee: Hiran Chaudhuri >Priority: Major > Fix For: 1.21 > > > The plugin protocol-smb advertized on > [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually > refers to the JCIFS library. According to this library's homepage > [https://www.jcifs.org/]: > _If you're looking for the latest and greatest open source Java SMB library, > this is not it. JCIFS has been in maintenance-mode-only for several years and > although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and > various utility classes), jCIFS does not support the newer SMB2/3 variants of > the SMB protocol which is slowly becoming required (Windows 10 requires > SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their > products. *So if SMB1 is disabled on your network, JCIFS' file related > operations will NOT work.*_ > Looking at > [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1] > _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June > 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators > Update do not have SMB1 installed by default._ > As a conclusion, the chances that SMB1 protocol is installed and/or > configured are getting vastly smaller. Therefore some migration towards > SMB2/3 is required. Luckily the JCIFS homepage lists alternatives: > * [jcifs-codelibs|https://github.com/codelibs/jcifs] > * [jcifs-ng|https://github.com/AgNO3/jcifs-ng] > * [smbj|https://github.com/hierynomus/smbj] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj
[ https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2856: --- Assignee: Hiran Chaudhuri (was: Lewis John McGibbney) > Implement a protocol-smb plugin based on hierynomus/smbj > > > Key: NUTCH-2856 > URL: https://issues.apache.org/jira/browse/NUTCH-2856 > Project: Nutch > Issue Type: New Feature > Components: external, plugin, protocol >Reporter: Hiran Chaudhuri >Assignee: Hiran Chaudhuri >Priority: Major > Fix For: 1.21 > > > The plugin protocol-smb advertized on > [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually > refers to the JCIFS library. According to this library's homepage > [https://www.jcifs.org/]: > _If you're looking for the latest and greatest open source Java SMB library, > this is not it. JCIFS has been in maintenance-mode-only for several years and > although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and > various utility classes), jCIFS does not support the newer SMB2/3 variants of > the SMB protocol which is slowly becoming required (Windows 10 requires > SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their > products. *So if SMB1 is disabled on your network, JCIFS' file related > operations will NOT work.*_ > Looking at > [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1] > _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June > 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators > Update do not have SMB1 installed by default._ > As a conclusion, the chances that SMB1 protocol is installed and/or > configured are getting vastly smaller. Therefore some migration towards > SMB2/3 is required. Luckily the JCIFS homepage lists alternatives: > * [jcifs-codelibs|https://github.com/codelibs/jcifs] > * [jcifs-ng|https://github.com/AgNO3/jcifs-ng] > * [smbj|https://github.com/hierynomus/smbj] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3064) Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0
[ https://issues.apache.org/jira/browse/NUTCH-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3064 started by Lewis John McGibbney. --- > Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0 > - > > Key: NUTCH-3064 > URL: https://issues.apache.org/jira/browse/NUTCH-3064 > Project: Nutch > Issue Type: Task > Components: index-geoip, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > A recent mailing list question about the index-geoip plugin prompted me to > take a look at it and perform any necessary maintenance. > As of writing, the latest dependency can be found at > [https://central.sonatype.com/artifact/com.maxmind.geoip2/geoip2] at v4.2.0. > At a minimum this ticket will accomplish the dependency update. I'll also > have a look at documentation and maybe provide some unit tests... which I > neglected to furnish last time around. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3064) Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0
Lewis John McGibbney created NUTCH-3064: --- Summary: Upgrade com.maxmind.geoip2:geoip2 dependency in geoip-index to v4.2.0 Key: NUTCH-3064 URL: https://issues.apache.org/jira/browse/NUTCH-3064 Project: Nutch Issue Type: Task Components: index-geoip, plugin Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 A recent mailing list question about the index-geoip plugin prompted me to take a look at it and perform any necessary maintenance. As of writing, the latest dependency can be found at [https://central.sonatype.com/artifact/com.maxmind.geoip2/geoip2] at v4.2.0. At a minimum this ticket will accomplish the dependency update. I'll also have a look at documentation and maybe provide some unit tests... which I neglected to furnish last time around. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3063) Support for "addBinaryContent" from REST API
[ https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867045#comment-17867045 ] Lewis John McGibbney commented on NUTCH-3063: - +1 for this patch. Any other reviewers? > Support for "addBinaryContent" from REST API > > > Key: NUTCH-3063 > URL: https://issues.apache.org/jira/browse/NUTCH-3063 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.20 >Reporter: Isabelle Giguere >Assignee: Isabelle Giguere >Priority: Major > Fix For: 1.21 > > Attachments: NUTCH-3063.patch > > > NUTCH-1785 added the possibility of requesting the raw binary content, with > arg `addBinaryContent`, and possibly encode it as `base64`. > This functionality should also be supported from the REST API. > Integrating Nutch using the CLI is out of the question for some applications, > and at the same time some may need the raw content for further processing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3063) Support for "addBinaryContent" from REST API
[ https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866863#comment-17866863 ] Lewis John McGibbney commented on NUTCH-3063: - [~igiguere] thanks for the patch. I assigned the issue to you :) Thanks for your contribution to the project. I will review the patch tonight. > Support for "addBinaryContent" from REST API > > > Key: NUTCH-3063 > URL: https://issues.apache.org/jira/browse/NUTCH-3063 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.20 >Reporter: Isabelle Giguere >Assignee: Isabelle Giguere >Priority: Major > Fix For: 1.21 > > Attachments: NUTCH-3063.patch > > > NUTCH-1785 added the possibility of requesting the raw binary content, with > arg `addBinaryContent`, and possibly encode it as `base64`. > This functionality should also be supported from the REST API. > Integrating Nutch using the CLI is out of the question for some applications, > and at the same time some may need the raw content for further processing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3063) Support for "addBinaryContent" from REST API
[ https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-3063: --- Assignee: Isabelle Giguere > Support for "addBinaryContent" from REST API > > > Key: NUTCH-3063 > URL: https://issues.apache.org/jira/browse/NUTCH-3063 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.20 >Reporter: Isabelle Giguere >Assignee: Isabelle Giguere >Priority: Major > Fix For: 1.21 > > Attachments: NUTCH-3063.patch > > > NUTCH-1785 added the possibility of requesting the raw binary content, with > arg `addBinaryContent`, and possibly encode it as `base64`. > This functionality should also be supported from the REST API. > Integrating Nutch using the CLI is out of the question for some applications, > and at the same time some may need the raw content for further processing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3041. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3041 stopped by Lewis John McGibbney. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3041. - Resolution: Fixed > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3054. --- > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3054. - Resolution: Fixed > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3054: Affects Version/s: 1.20 > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
Lewis John McGibbney created NUTCH-3054: --- Summary: Address deprecation of Node16 for all GitHub Actions Key: NUTCH-3054 URL: https://issues.apache.org/jira/browse/NUTCH-3054 Project: Nutch Issue Type: Task Components: ci/cd Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 See [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] We need to upgrade the setup-java action in [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3054) Address deprecation of Node16 for all GitHub Actions
[ https://issues.apache.org/jira/browse/NUTCH-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3054 started by Lewis John McGibbney. --- > Address deprecation of Node16 for all GitHub Actions > > > Key: NUTCH-3054 > URL: https://issues.apache.org/jira/browse/NUTCH-3054 > Project: Nutch > Issue Type: Task > Components: ci/cd >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > See > [https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/] > We need to upgrade the setup-java action in > [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] > > Patch coming up -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3049) Investigate using Records
[ https://issues.apache.org/jira/browse/NUTCH-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842208#comment-17842208 ] Lewis John McGibbney commented on NUTCH-3049: - I think that each of the Writable classes mentioned in NutchWritable may be fair game {{ org.apache.nutch.crawl.CrawlDatum.class,}} {{ org.apache.nutch.crawl.Inlink.class,}} {{ org.apache.nutch.crawl.Inlinks.class,}} {{ org.apache.nutch.indexer.NutchIndexAction.class,}} {{ org.apache.nutch.metadata.Metadata.class,}} {{ org.apache.nutch.parse.Outlink.class,}} {{ org.apache.nutch.parse.ParseText.class,}} {{ org.apache.nutch.parse.ParseData.class,}} {{ org.apache.nutch.parse.ParseImpl.class,}} {{ org.apache.nutch.parse.ParseStatus.class,}} {{ org.apache.nutch.protocol.Content.class,}} {{ org.apache.nutch.protocol.ProtocolStatus.class,}} {{ org.apache.nutch.scoring.webgraph.LinkDatum.class,}} {{ org.apache.nutch.hostdb.HostDatum.class}} > Investigate using Records > - > > Key: NUTCH-3049 > URL: https://issues.apache.org/jira/browse/NUTCH-3049 > Project: Nutch > Issue Type: Sub-task >Reporter: Lewis John McGibbney >Priority: Major > > Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records] > i think there are multiple areas where we could use Records. This ticket will > document the opportunities and structure that work. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3053) Upgrade build and CI to JDK17
Lewis John McGibbney created NUTCH-3053: --- Summary: Upgrade build and CI to JDK17 Key: NUTCH-3053 URL: https://issues.apache.org/jira/browse/NUTCH-3053 Project: Nutch Issue Type: Sub-task Components: build, ci/cd Reporter: Lewis John McGibbney This will involves changes to * [https://github.com/apache/nutch/blob/master/.github/workflows/master-build.yml] * [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/] * [https://github.com/apache/nutch/blob/master/default.properties#L46] * [https://github.com/apache/nutch/blob/master/default.properties#L57] * We should also investigate any deprecation notices in the build output * [https://github.com/apache/nutch/blob/master/ivy/mvn.template#L128-L129] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3052) Investigate using sealed classes
Lewis John McGibbney created NUTCH-3052: --- Summary: Investigate using sealed classes Key: NUTCH-3052 URL: https://issues.apache.org/jira/browse/NUTCH-3052 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#sealed-classes] First document if and where sealed classes would add value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3051) Investigate using new pattern matching syntax in switch expressions
Lewis John McGibbney created NUTCH-3051: --- Summary: Investigate using new pattern matching syntax in switch expressions Key: NUTCH-3051 URL: https://issues.apache.org/jira/browse/NUTCH-3051 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#2-switch-expressions] Apparently we use switch in 35 files [https://github.com/search?q=repo%3Aapache%2Fnutch+switch+language%3AJava&type=code&l=Java] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3050) Investigate use of the enhanced instanceof operator
Lewis John McGibbney created NUTCH-3050: --- Summary: Investigate use of the enhanced instanceof operator Key: NUTCH-3050 URL: https://issues.apache.org/jira/browse/NUTCH-3050 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-enhanced-instanceof-operator] Apparently we use instanceof operator in 50 files [https://github.com/search?q=repo%3Aapache%2Fnutch%20instanceof&type=code] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3049) Investigate using Records
Lewis John McGibbney created NUTCH-3049: --- Summary: Investigate using Records Key: NUTCH-3049 URL: https://issues.apache.org/jira/browse/NUTCH-3049 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#records] i think there are multiple areas where we could use Records. This ticket will document the opportunities and structure that work. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3048) Investigate where/if new string utility methods could be used
Lewis John McGibbney created NUTCH-3048: --- Summary: Investigate where/if new string utility methods could be used Key: NUTCH-3048 URL: https://issues.apache.org/jira/browse/NUTCH-3048 Project: Nutch Issue Type: Sub-task Components: util Reporter: Lewis John McGibbney Guidance at [https://www.baeldung.com/java-migrate-8-to-17#3-new-string-methods] We may be able to also revisit our usage of common-* libraries with tje goal of using native methods from JDK. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3047) Use multi-line text blocks
Lewis John McGibbney created NUTCH-3047: --- Summary: Use multi-line text blocks Key: NUTCH-3047 URL: https://issues.apache.org/jira/browse/NUTCH-3047 Project: Nutch Issue Type: Sub-task Components: CLI Reporter: Lewis John McGibbney Guidance available at [https://www.baeldung.com/java-migrate-8-to-17#2-text-block] This will help to cleanup our CLI *usage()* messages at a bare minimum. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3046) Use compact strings
[ https://issues.apache.org/jira/browse/NUTCH-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3046: Description: Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are 9 instances where we use _*char []*_ |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D&type=code]]. was: Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are [9 instances where we use char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D&type=code]]. > Use compact strings > --- > > Key: NUTCH-3046 > URL: https://issues.apache.org/jira/browse/NUTCH-3046 > Project: Nutch > Issue Type: Sub-task >Reporter: Lewis John McGibbney >Priority: Major > > Follow the guidance at > [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] > It looks like there are 9 instances where we use _*char []*_ > |[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D&type=code]]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3046) Use compact strings
Lewis John McGibbney created NUTCH-3046: --- Summary: Use compact strings Key: NUTCH-3046 URL: https://issues.apache.org/jira/browse/NUTCH-3046 Project: Nutch Issue Type: Sub-task Reporter: Lewis John McGibbney Follow the guidance at [https://www.baeldung.com/java-migrate-8-to-17#1-compact-string] It looks like there are [9 instances where we use char[]|[https://github.com/search?q=repo%3Aapache%2Fnutch%20char%5B%5D&type=code]]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3045) Upgrade from Java 11 to 17
Lewis John McGibbney created NUTCH-3045: --- Summary: Upgrade from Java 11 to 17 Key: NUTCH-3045 URL: https://issues.apache.org/jira/browse/NUTCH-3045 Project: Nutch Issue Type: Task Components: build, ci/cd Reporter: Lewis John McGibbney Fix For: 1.21 This parent issue will track and organize work pertaining to upgrading Nutch to JDK 17. Premier support for Oracle JDK 11 ended 7 months ago (30 Sep 2023). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3042) Use GitHub cache action to improve CI execution time
[ https://issues.apache.org/jira/browse/NUTCH-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3042: Description: With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I [created a discussion|[https://github.com/actions/cache/discussions/1381]] to get conformation. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. was: With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I created a discussion to get conformation if this is the case. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. > Use GitHub cache action to improve CI execution time > > > Key: NUTCH-3042 > URL: https://issues.apache.org/jira/browse/NUTCH-3042 > Project: Nutch > Issue Type: Task > Components: ci/cd >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.21 > > > With the Ant+Ivy build architecture, the current GitHub actions workflow can > and regularly does take over 20 minutes to complete. Dependency retrieval > takes a significant amount of time. > I think we can address the above issue and dramatically reduce the CI runtime > by utilizing the official [GitHiub cache > action|[https://github.com/actions/cache]]. > It appears however that the action does not support the Apache Ivy cache. > Both Maven and Gradle are supported. I [created a > discussion|[https://github.com/actions/cache/discussions/1381]] to get > conformation. > In the case that we cannot implement a cache for the Ivy build system then we > will need to come back to this issue once we migrate to Gradle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3042) Use GitHub cache action to improve CI execution time
Lewis John McGibbney created NUTCH-3042: --- Summary: Use GitHub cache action to improve CI execution time Key: NUTCH-3042 URL: https://issues.apache.org/jira/browse/NUTCH-3042 Project: Nutch Issue Type: Task Components: ci/cd Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 With the Ant+Ivy build architecture, the current GitHub actions workflow can and regularly does take over 20 minutes to complete. Dependency retrieval takes a significant amount of time. I think we can address the above issue and dramatically reduce the CI runtime by utilizing the official [GitHiub cache action|[https://github.com/actions/cache]]. It appears however that the action does not support the Apache Ivy cache. Both Maven and Gradle are supported. I created a discussion to get conformation if this is the case. In the case that we cannot implement a cache for the Ivy build system then we will need to come back to this issue once we migrate to Gradle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3041 started by Lewis John McGibbney. --- > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3041: Description: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation is actually configured to be used at runtime. I will provide a patch for this. was: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation is actually configured to be used at > runtime. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
[ https://issues.apache.org/jira/browse/NUTCH-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3041: Description: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|#L47-L48]] provides the following logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. was: URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]] provides some confusing INFO-level logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. > Address confusing logging in o.a.n.net.URLExemptionFilters > --- > > Key: NUTCH-3041 > URL: https://issues.apache.org/jira/browse/NUTCH-3041 > Project: Nutch > Issue Type: Task > Components: net >Affects Versions: 1.19, 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.21 > > > URLExemptionFilter impementations are used to allow exemptions to external > domain resources by overriding the {{db.ignore.external.links}} configuration > setting. This is useful when the crawl is focused to a domain but resources > like images are hosted on CDN. > Currently [URLExemptionFilters|#L47-L48]] provides the following logging > {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor > #0|#0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' > {quote} > I find this confusing. It would be better to log *only* if an > URLExemptionFilter implementation actually exists for a given URL. > I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3041) Address confusing logging in o.a.n.net.URLExemptionFilters
Lewis John McGibbney created NUTCH-3041: --- Summary: Address confusing logging in o.a.n.net.URLExemptionFilters Key: NUTCH-3041 URL: https://issues.apache.org/jira/browse/NUTCH-3041 Project: Nutch Issue Type: Task Components: net Affects Versions: 1.19, 1.20 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.21 URLExemptionFilter impementations are used to allow exemptions to external domain resources by overriding the {{db.ignore.external.links}} configuration setting. This is useful when the crawl is focused to a domain but resources like images are hosted on CDN. Currently [URLExemptionFilters|[https://github.com/apache/nutch/blob/271f92e11c39b7a3583cfcd8d664262cfac59674/src/java/org/apache/nutch/net/URLExemptionFilters.java#L47-L48]] provides some confusing INFO-level logging {quote}INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' {quote} I find this confusing. It would be better to log *only* if an URLExemptionFilter implementation actually exists for a given URL. I will provide a patch for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3038. - Resolution: Fixed > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3038. --- Thanks [~snagel] > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3038 stopped by Lewis John McGibbney. --- > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3038 started by Lewis John McGibbney. --- > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
[ https://issues.apache.org/jira/browse/NUTCH-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3038: Description: During the 1.20 release management dryrun I discovered the following issues which I think should be addressed in order to be satisfied with the release candidate # Update docker/README to remove broken badge # Upgrade alpine base image in docker/Dockerfile # Migrate CHANGES.txt to CHANGES.md # Upgrade apache parent pom version from 23 to 31 # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in ivy/mvn.template # Remove miredot plugin usage from ivy/mvn.template was: During the 1.20 release management dryrun I discovered the following issues which I think should be addressed in order to be satisfied with the release candidate # Update docker/README to remove broken badge # Upgrade alpine base image in docker/Dockerfile # Migrate CHANGES.txt to CHANGES.md # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in ivy/mvn.template # Remove miredot plugin usage from ivy/mvn.template > Address issues discovered during 1.20 release management dryrun > --- > > Key: NUTCH-3038 > URL: https://issues.apache.org/jira/browse/NUTCH-3038 > Project: Nutch > Issue Type: Task > Components: build, docker >Affects Versions: 1.20 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: 1.20 > > > During the 1.20 release management dryrun I discovered the following issues > which I think should be addressed in order to be satisfied with the release > candidate > # Update docker/README to remove broken badge > # Upgrade alpine base image in docker/Dockerfile > # Migrate CHANGES.txt to CHANGES.md > # Upgrade apache parent pom version from 23 to 31 > # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml > # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in > ivy/mvn.template > # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3038) Address issues discovered during 1.20 release management dryrun
Lewis John McGibbney created NUTCH-3038: --- Summary: Address issues discovered during 1.20 release management dryrun Key: NUTCH-3038 URL: https://issues.apache.org/jira/browse/NUTCH-3038 Project: Nutch Issue Type: Task Components: build, docker Affects Versions: 1.20 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 During the 1.20 release management dryrun I discovered the following issues which I think should be addressed in order to be satisfied with the release candidate # Update docker/README to remove broken badge # Upgrade alpine base image in docker/Dockerfile # Migrate CHANGES.txt to CHANGES.md # Upgrade maven-gpg-plugin dependency from 1.6 to 3.2.2 in build.xml # Upgrade maven-compiler-plugin version from 3.8.1 to 3.13.0 in ivy/mvn.template # Remove miredot plugin usage from ivy/mvn.template -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3032. --- Thanks [~jglvary] and congratulations on your first contribution to Apache Nutch :) > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Assignee: Joe Gilvary >Priority: Major > Labels: indexing > Fix For: 1.20 > > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3032: Fix Version/s: 1.20 > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Assignee: Joe Gilvary >Priority: Major > Labels: indexing > Fix For: 1.20 > > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-3032) Indexing plugin as an adapter for end user's own POJO instances
[ https://issues.apache.org/jira/browse/NUTCH-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-3032: --- Assignee: Joe Gilvary > Indexing plugin as an adapter for end user's own POJO instances > --- > > Key: NUTCH-3032 > URL: https://issues.apache.org/jira/browse/NUTCH-3032 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Joe Gilvary >Assignee: Joe Gilvary >Priority: Major > Labels: indexing > Attachments: NUTCH-3032.patch > > > It could be helpful to let end users manipulate information at indexing time > with their own code without the need for writing their own indexing plugin. I > mentioned this on the dev mailing list > (https://www.mail-archive.com/dev@nutch.apache.org/msg31190.html) with some > description of my work in progress. > One potential use is to address some of the same concerns that NUTCH-585 > discusses regarding an alternative approach to picking and choosing which > content to index, but this approach would allow making index time decisions, > rather than setting the configuration for all content at the start of the > indexing run. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-2856) Implement a protocol-smb plugin based on hierynomus/smbj
[ https://issues.apache.org/jira/browse/NUTCH-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2856 stopped by Lewis John McGibbney. --- > Implement a protocol-smb plugin based on hierynomus/smbj > > > Key: NUTCH-2856 > URL: https://issues.apache.org/jira/browse/NUTCH-2856 > Project: Nutch > Issue Type: New Feature > Components: external, plugin, protocol >Reporter: Hiran Chaudhuri >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > The plugin protocol-smb advertized on > [https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral] actually > refers to the JCIFS library. According to this library's homepage > [https://www.jcifs.org/]: > _If you're looking for the latest and greatest open source Java SMB library, > this is not it. JCIFS has been in maintenance-mode-only for several years and > although what it does support works fine (SMB1, NTLMv2, midlc, MSRPC and > various utility classes), jCIFS does not support the newer SMB2/3 variants of > the SMB protocol which is slowly becoming required (Windows 10 requires > SMB2/3). JCIFS only supports SMB1 but Microsoft has deprecated SMB1 in their > products. *So if SMB1 is disabled on your network, JCIFS' file related > operations will NOT work.*_ > Looking at > [https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1:|https://en.wikipedia.org/wiki/Server_Message_Block#SMB_/_CIFS_/_SMB1] > _Microsoft added SMB1 to the Windows Server 2012 R2 deprecation list in June > 2013. Windows Server 2016 and some versions of Windows 10 Fall Creators > Update do not have SMB1 installed by default._ > As a conclusion, the chances that SMB1 protocol is installed and/or > configured are getting vastly smaller. Therefore some migration towards > SMB2/3 is required. Luckily the JCIFS homepage lists alternatives: > * [jcifs-codelibs|https://github.com/codelibs/jcifs] > * [jcifs-ng|https://github.com/AgNO3/jcifs-ng] > * [smbj|https://github.com/hierynomus/smbj] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-2887) Migrate to JUnit 5 Jupiter
[ https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2887 stopped by Lewis John McGibbney. --- > Migrate to JUnit 5 Jupiter > -- > > Key: NUTCH-2887 > URL: https://issues.apache.org/jira/browse/NUTCH-2887 > Project: Nutch > Issue Type: Improvement > Components: test > Environment: Migrate >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > This effort is a bit of a beast. See the [JUnit migration > tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips] > for general guidance. A general grep for junit in src produces the following > {code:bash} > ./test/nutch-site.xml > ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java > ./test/org/apache/nutch/net/TestURLNormalizers.java > ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java > ./test/org/apache/nutch/net/TestURLFilters.java > ./test/org/apache/nutch/util/TestStringUtil.java > ./test/org/apache/nutch/util/TestSuffixStringMatcher.java > ./test/org/apache/nutch/util/TestEncodingDetector.java > ./test/org/apache/nutch/util/TestMimeUtil.java > ./test/org/apache/nutch/util/TestPrefixStringMatcher.java > ./test/org/apache/nutch/util/DumpFileUtilTest.java > ./test/org/apache/nutch/util/TestNodeWalker.java > ./test/org/apache/nutch/util/WritableTestUtils.java > ./test/org/apache/nutch/util/TestTableUtil.java > ./test/org/apache/nutch/util/TestURLUtil.java > ./test/org/apache/nutch/util/TestGZIPUtils.java > ./test/org/apache/nutch/parse/TestParseText.java > ./test/org/apache/nutch/parse/TestOutlinks.java > ./test/org/apache/nutch/parse/TestParseData.java > ./test/org/apache/nutch/parse/TestOutlinkExtractor.java > ./test/org/apache/nutch/parse/TestParserFactory.java > ./test/org/apache/nutch/segment/TestSegmentMerger.java > ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java > ./test/org/apache/nutch/plugin/TestPluginSystem.java > ./test/org/apache/nutch/fetcher/TestFetcher.java > ./test/org/apache/nutch/protocol/TestProtocolFactory.java > ./test/org/apache/nutch/protocol/TestContent.java > ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java > ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java > ./test/org/apache/nutch/crawl/TestTextProfileSignature.java > ./test/org/apache/nutch/crawl/TestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestGenerator.java > ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java > ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestSignatureFactory.java > ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java > ./test/org/apache/nutch/crawl/TestInjector.java > ./test/org/apache/nutch/crawl/TestLinkDbMerger.java > ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java > ./test/org/apache/nutch/service/TestNutchServer.java > ./test/org/apache/nutch/metadata/TestMetadata.java > ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java > ./test/org/apache/nutch/indexer/TestIndexingFilters.java > ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java > ./bin/nutch > ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java > ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java > ./plugin/urlfilter-domaindenylist/build.xml > ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java > ./plugin/protocol-imaps/plugin.xml > ./plugin/protocol-imaps/ivy.xml > ./plugin/protocol-imaps/lib/junit-4.13.jar > ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar > ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar > ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java > ./plugin/protocol-file/build.xml > ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java > ./plugin/urlnormalizer-regex/build.xml > ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java > ./plugin/build-plugin.xml > ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java > ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > ./plugin/urlnormalizer-protocol/build.xml > ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java > ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java > ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java > ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java > ./plugin/parse-ext/src/test/org/apache/nutc
[jira] [Closed] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch
[ https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2832. --- > Create tutorial on sending Nutch logs to Elasticsearch > -- > > Key: NUTCH-2832 > URL: https://issues.apache.org/jira/browse/NUTCH-2832 > Project: Nutch > Issue Type: New Feature > Components: configuration, deployment >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > A while back I used to use [Chukwa|https://chukwa.apache.org/] for log > aggregation and analysis. Chukwa is now retired. > I a bit of research into directly logging Log4j2 into Elasticsearch and came > across > [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which > looks pretty simple. > I'm going to have a crack at implementing this functionality as a > configuration option. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-2832) Create tutorial on sending Nutch logs to Elasticsearch
[ https://issues.apache.org/jira/browse/NUTCH-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2832. - Resolution: Won't Fix Given the license changes regarding the concerned backend I have no interest implementing this anymore. > Create tutorial on sending Nutch logs to Elasticsearch > -- > > Key: NUTCH-2832 > URL: https://issues.apache.org/jira/browse/NUTCH-2832 > Project: Nutch > Issue Type: New Feature > Components: configuration, deployment >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > A while back I used to use [Chukwa|https://chukwa.apache.org/] for log > aggregation and analysis. Chukwa is now retired. > I a bit of research into directly logging Log4j2 into Elasticsearch and came > across > [log4j2-elasticsearch|https://github.com/rfoltyns/log4j2-elasticsearch] which > looks pretty simple. > I'm going to have a crack at implementing this functionality as a > configuration option. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
[ https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3036. - Resolution: Fixed > Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium > > > Key: NUTCH-3036 > URL: https://issues.apache.org/jira/browse/NUTCH-3036 > Project: Nutch > Issue Type: Improvement > Components: plugin, selenium >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > lib-selenium currently packages org.seleniumhq.selenium:selenium-java > *v4.7.2* but *v4.18.1* is available on Maven Central. > This ticket will upgrade the java dependency and validate that both > protocol-selenium and protocol-interactiveselenium work as expected in local > mode and via selenium grid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
[ https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3036. --- > Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium > > > Key: NUTCH-3036 > URL: https://issues.apache.org/jira/browse/NUTCH-3036 > Project: Nutch > Issue Type: Improvement > Components: plugin, selenium >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > lib-selenium currently packages org.seleniumhq.selenium:selenium-java > *v4.7.2* but *v4.18.1* is available on Maven Central. > This ticket will upgrade the java dependency and validate that both > protocol-selenium and protocol-interactiveselenium work as expected in local > mode and via selenium grid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3035) Update license and notice file for release of 1.20
[ https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3035. --- > Update license and notice file for release of 1.20 > --- > > Key: NUTCH-3035 > URL: https://issues.apache.org/jira/browse/NUTCH-3035 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Close to the release of 1.20 the license and notice files should be updated > to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and > NUTCH-2981. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3035) Update license and notice file for release of 1.20
[ https://issues.apache.org/jira/browse/NUTCH-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3035. - Resolution: Fixed > Update license and notice file for release of 1.20 > --- > > Key: NUTCH-3035 > URL: https://issues.apache.org/jira/browse/NUTCH-3035 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.20 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Close to the release of 1.20 the license and notice files should be updated > to contain all (third-party) licenses of all dependencies. Cf. NUTCH-2290 and > NUTCH-2981. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
[ https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3037. - Resolution: Fixed > Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 > -- > > Key: NUTCH-3037 > URL: https://issues.apache.org/jira/browse/NUTCH-3037 > Project: Nutch > Issue Type: Task > Components: indexer-kafka >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, > I therefore propose to upgrade. > I will also state that a _*kafka_2.13*_ artifact exists. This would demand > that the underlying Scala version be also upgraded... but I think this should > be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
[ https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3037. --- > Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 > -- > > Key: NUTCH-3037 > URL: https://issues.apache.org/jira/browse/NUTCH-3037 > Project: Nutch > Issue Type: Task > Components: indexer-kafka >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, > I therefore propose to upgrade. > I will also state that a _*kafka_2.13*_ artifact exists. This would demand > that the underlying Scala version be also upgraded... but I think this should > be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
[ https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3037 stopped by Lewis John McGibbney. --- > Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 > -- > > Key: NUTCH-3037 > URL: https://issues.apache.org/jira/browse/NUTCH-3037 > Project: Nutch > Issue Type: Task > Components: indexer-kafka >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, > I therefore propose to upgrade. > I will also state that a _*kafka_2.13*_ artifact exists. This would demand > that the underlying Scala version be also upgraded... but I think this should > be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
[ https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3037: Flags: Patch > Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 > -- > > Key: NUTCH-3037 > URL: https://issues.apache.org/jira/browse/NUTCH-3037 > Project: Nutch > Issue Type: Task > Components: indexer-kafka >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, > I therefore propose to upgrade. > I will also state that a _*kafka_2.13*_ artifact exists. This would demand > that the underlying Scala version be also upgraded... but I think this should > be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
[ https://issues.apache.org/jira/browse/NUTCH-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3037 started by Lewis John McGibbney. --- > Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 > -- > > Key: NUTCH-3037 > URL: https://issues.apache.org/jira/browse/NUTCH-3037 > Project: Nutch > Issue Type: Task > Components: indexer-kafka >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, > I therefore propose to upgrade. > I will also state that a _*kafka_2.13*_ artifact exists. This would demand > that the underlying Scala version be also upgraded... but I think this should > be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3037) Upgrade org.apache.kafka:kafka_2.12: to v3.7.0
Lewis John McGibbney created NUTCH-3037: --- Summary: Upgrade org.apache.kafka:kafka_2.12: to v3.7.0 Key: NUTCH-3037 URL: https://issues.apache.org/jira/browse/NUTCH-3037 Project: Nutch Issue Type: Task Components: indexer-kafka Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 We depend on v1.1.0 which is quite a bit behind the current v3.7.0 artifact, I therefore propose to upgrade. I will also state that a _*kafka_2.13*_ artifact exists. This would demand that the underlying Scala version be also upgraded... but I think this should be addressed in a separate ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
[ https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3036 stopped by Lewis John McGibbney. --- > Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium > > > Key: NUTCH-3036 > URL: https://issues.apache.org/jira/browse/NUTCH-3036 > Project: Nutch > Issue Type: Improvement > Components: plugin, selenium >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > lib-selenium currently packages org.seleniumhq.selenium:selenium-java > *v4.7.2* but *v4.18.1* is available on Maven Central. > This ticket will upgrade the java dependency and validate that both > protocol-selenium and protocol-interactiveselenium work as expected in local > mode and via selenium grid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
[ https://issues.apache.org/jira/browse/NUTCH-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3036 started by Lewis John McGibbney. --- > Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium > > > Key: NUTCH-3036 > URL: https://issues.apache.org/jira/browse/NUTCH-3036 > Project: Nutch > Issue Type: Improvement > Components: plugin, selenium >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > lib-selenium currently packages org.seleniumhq.selenium:selenium-java > *v4.7.2* but *v4.18.1* is available on Maven Central. > This ticket will upgrade the java dependency and validate that both > protocol-selenium and protocol-interactiveselenium work as expected in local > mode and via selenium grid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3036) Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium
Lewis John McGibbney created NUTCH-3036: --- Summary: Upgrade org.seleniumhq.selenium:selenium-java dependency in lib-selenium Key: NUTCH-3036 URL: https://issues.apache.org/jira/browse/NUTCH-3036 Project: Nutch Issue Type: Improvement Components: selenium, plugin Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 lib-selenium currently packages org.seleniumhq.selenium:selenium-java *v4.7.2* but *v4.18.1* is available on Maven Central. This ticket will upgrade the java dependency and validate that both protocol-selenium and protocol-interactiveselenium work as expected in local mode and via selenium grid. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17826776#comment-17826776 ] Lewis John McGibbney commented on NUTCH-3029: - Hi [~martin.dj] [~markus17] it looks like we are missing some Javadoc {quote} [javadoc] Standard Doclet version 11.0.22 {quote} {quote} [javadoc] Building tree for all the packages and classes... [javadoc] /home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193: warning: no @param for url [javadoc] public static String getHostName(String url) throws URISyntaxException { [javadoc] ^ [javadoc] /home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193: warning: no @return [javadoc] public static String getHostName(String url) throws URISyntaxException { [javadoc] ^ [javadoc] /home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:193: warning: no @throws for java.net.URISyntaxException [javadoc] public static String getHostName(String url) throws URISyntaxException { [javadoc] ^ [javadoc] /home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:205: warning: no @return [javadoc] public float getMaxInterval(Text url, float defaultMaxInterval){ [javadoc] ^ [javadoc] /home/runner/work/nutch/nutch/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java:227: warning: no @return [javadoc] public float getMinInterval(Text url, float defaultMinInterval){ {quote} {quote} [javadoc] ^{quote} > Host specific max. and min. intervals in adaptive scheduler > --- > > Key: NUTCH-3029 > URL: https://issues.apache.org/jira/browse/NUTCH-3029 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.19, 1.20 >Reporter: Martin Djukanovic >Assignee: Markus Jelsma >Priority: Minor > Attachments: adaptive-host-specific-intervals.txt.template, > new_adaptive_fetch_schedule-1.patch > > > This patch implements custom max. and min. refetching intervals for specific > hosts, in the AdaptiveFetchSchedule class. The intervals are set up in a .txt > configuration file (template also attached). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3033. --- > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3033. - Resolution: Fixed > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3033: Due Date: 12/Mar/24 (was: 11/Mar/24) > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3033 stopped by Lewis John McGibbney. --- > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… currently 7 tests as of writing. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. # generally speaking, any reduction of code in the Nutch codebase through careful selection and dependence of well maintained, well tested 3rd party libraries would be a good thing for the Nutch codebase. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :). Generally speaking just familiarize ones-self with the legacy plugin framework and understand where the gaps are. # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will provide an opportunity to identify gaps between what the legacy plugin framework does (and what Nutch) needs Vs what PF4J provides. Touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. Create mapping of [legacy Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]] to [PF4J equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]]. # {*}Restructure the legacy Nutch plugin package{*}: [https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin] # {*}Restructure each plugin in the plugins directory{*}: [https://github.com/apache/nutch/tree/master/src/plugin] # *Update Nutch plugin documentation* # {*}Create/propose plugin utility toolings{*}: #4 in the motivation section states that developing plugins in clunky. A utility tool which streamlines the creation of new plugins would be ideal. For example, this could take the form of a [new bash script|[https://github.com/apache/nutch/tree/master/src/bin]] which prompts the developer for input and then generates the plugin skeleton. {*}This is a nice to have{*}. h1. Google Summer of Code Details This initiative is being proposed as a GSoC 2024 project. {*}Proposed Mentor{*}: [~lewismc] {*}Proposed Co-Mentor{*}: was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… currently 7 tests as of writing. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. # generally speaking, any reduction of code in the Nutch codebase through careful selection and dependence of well maintained, well tested 3rd party libraries would be a good thing for the Nutch codebase. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :). Generally speaking just familiarize ones-self with the legacy plugin framework and understand where the gaps are. # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will provide an opportunity to identify gaps between what the legacy plugin framework does (and what Nutch) needs Vs what PF4J provides. Touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. Create mapping of [legacy Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]] to [PF4J equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]]. # {*}Restructure the legacy Nutch plugin package{*}: [https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin] # {*}Restructure each plugin in the plugins directory{*}: [https://github.com/apache/nutch/tree/master/src/plugin] h1. Google Summer of Code Details This initiative is being proposed as a GSoC 2024 project. {*}Proposed Mentor{*}: [~lewismc] {*}Proposed Co-Mentor{*}: was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… currently 7 tests as of writing. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. # generally speaking, any reduction of code in the Nutch codebase through careful selection and dependence of well maintained, well tested 3rd party libraries would be a good thing for the Nutch codebase. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). # {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :). Generally speaking just familiarize ones-self with the legacy plugin framework and understand where the gaps are. # *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will provide an opportunity to identify gaps between what the legacy plugin framework does (and what Nutch) needs Vs what PF4J provides. Touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. Create mapping of [legacy Classes|[https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin]] to [PF4J equivalents|[https://github.com/pf4j/pf4j/tree/master/pf4j/src/main/java/org/pf4j]]. # {*}Restructure the legacy Nutch plugin package{*}: [https://github.com/apache/nutch/tree/master/src/java/org/apache/nutch/plugin] # {*}Restructure each plugin in the plugins directory{*}: [https://github.com/apache/nutch/tree/master/src/plugin] # was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # se
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :). * *study PF4J framework and* {*}perform feasibility study{*}{*};{*} this will provide an opportunity to identify gaps between what the legacy plugin framework does (and what Nutch) needs Vs what PF4J provides. Touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :) * was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are fairly well documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plu
[jira] [Updated] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
[ https://issues.apache.org/jira/browse/NUTCH-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3034: Description: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are [fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :) * was: h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are fairly well documented|h[ttps://cwiki.apache.org/confluence/display/NUTCH/PluginCentral|https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, {*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [
[jira] [Created] (NUTCH-3034) Overhaul the legacy Nutch plugin framework and replace it with PF4J
Lewis John McGibbney created NUTCH-3034: --- Summary: Overhaul the legacy Nutch plugin framework and replace it with PF4J Key: NUTCH-3034 URL: https://issues.apache.org/jira/browse/NUTCH-3034 Project: Nutch Issue Type: Improvement Components: pf4j, plugin Reporter: Lewis John McGibbney h1. Motivation Plugins provide a large part of the functionality of Nutch. Although the legacy plugin framework continues to offer lots of value i.e., # [some aspects e.g. examples, are fairly well documented|[https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral]] # it is generally stable, and # offers reasonable test coverage (on a plugin-by-plugin basis) # … probably loads more positives which I am overlooking... … there are also several aspects which could be improved # the [core framework is sparsely documented|[https://cwiki.apache.org/confluence/display/NUTCH/WhichTechnicalConceptsAreBehindTheNutchPluginSystem]], this extends to very important aspects like the {*}plugin lifecycle{*}, {*}classloading{*}, {*}packaging{*}, \{*}thread safety{*}, and lots of other topics which are of intrinsic value to developers and maintainers. # the core framework is somewhat [sparsely tested|[https://github.com/apache/nutch/blob/master/src/test/org/apache/nutch/plugin/TestPluginSystem.java]]… only 7 tests. Traditionally, developers have focused on providing unit tests on the plugin-level as opposed to the legacy plugin framework. # see’s very low maintenance/attention. It is my gut feeling (and I may be totally wrong here) but I _think_ that not many people know much about the core legacy plugin framework. # writing plugins is clunky. This largely has to do with the legacy Ant + Ivy build and dependency management system, but that being said, it is clunky non-the-less. *This issue therefore proposes to overhaul the* *legacy* *Nutch plugin framework and replace it with Plugin Framework for Java (PF4J).* h1. Task Breakdown The following is a proposed breakdown of this overall initiative intp Epics. These Epics should likely be decomposed further but that will be left down to the implementer(s). * {*}perform feasibility study{*}; touch base with the PF4J community, describe the intention to replace the legacy Nutch plugin framework with PF4J. Obtain guidance on how to proceed. Document this all in the Nutch wiki. * {*}document the legacy Nutch plugin lifecycle{*}; taking inspiration from [PF4J’s plugin lifecycle documentaiton|[https://pf4j.org/doc/plugin-lifecycle.html]] provide both documentation and a diagram which clearly outline how the legacy plugin lifecycle works. Might also be a good idea to make a contribution to PF4J and provide them with a diagram to accompany their documentation :) * -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3033) Upgrade Ivy to v2.5.2
Lewis John McGibbney created NUTCH-3033: --- Summary: Upgrade Ivy to v2.5.2 Key: NUTCH-3033 URL: https://issues.apache.org/jira/browse/NUTCH-3033 Project: Nutch Issue Type: Task Components: ivy Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3033) Upgrade Ivy to v2.5.2
[ https://issues.apache.org/jira/browse/NUTCH-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3033 started by Lewis John McGibbney. --- > Upgrade Ivy to v2.5.2 > - > > Key: NUTCH-3033 > URL: https://issues.apache.org/jira/browse/NUTCH-3033 > Project: Nutch > Issue Type: Task > Components: ivy >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > Ivy v2.5.2 was released August 20th 2023. Let’s upgrade. > [https://ant.apache.org/ivy/history/2.5.2/release-notes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3024) Remove flaky 'dependency check' target
[ https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3024. --- > Remove flaky 'dependency check' target > -- > > Key: NUTCH-3024 > URL: https://issues.apache.org/jira/browse/NUTCH-3024 > Project: Nutch > Issue Type: Task > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > I [started a > thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] > covering my observations running the ant _*dependency-check*_ target. It > fails unpredictably in both GitHub actions and our trusty Jenkins builds on > ci-builds.apache.org. > I propose to simply remove this target (and associated configuration) in a > bid to clean up some flaky legacy build code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3024) Remove flaky 'dependency check' target
[ https://issues.apache.org/jira/browse/NUTCH-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3024. - Resolution: Fixed > Remove flaky 'dependency check' target > -- > > Key: NUTCH-3024 > URL: https://issues.apache.org/jira/browse/NUTCH-3024 > Project: Nutch > Issue Type: Task > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > I [started a > thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] > covering my observations running the ant _*dependency-check*_ target. It > fails unpredictably in both GitHub actions and our trusty Jenkins builds on > ci-builds.apache.org. > I propose to simply remove this target (and associated configuration) in a > bid to clean up some flaky legacy build code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3007) Fix impossible casts
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3007. --- > Fix impossible casts > > > Key: NUTCH-3007 > URL: https://issues.apache.org/jira/browse/NUTCH-3007 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.19 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.20 > > > Spotbugs reports two occurrences of > Impossible cast from java.util.ArrayList to String[] in > org.apache.nutch.fetcher.Fetcher.run(Map, String) > Both were introduced later into the {{run(Map args, String > crawlId)}} method and obviously never used (would throw a > ClassCastException). The code blocks should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2846) Fix various bugs spotted by NUTCH-2815
[ https://issues.apache.org/jira/browse/NUTCH-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2846. --- > Fix various bugs spotted by NUTCH-2815 > -- > > Key: NUTCH-2846 > URL: https://issues.apache.org/jira/browse/NUTCH-2846 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.19 > > > This issue addresses various bugs spotted by Spotbugs (NUTCH-2815): > - use static method Integer.parseInt(...) > - use integer arithmetic instead of floating point with rounding floats > afterwards > - erroneous declaration of constructor in BasicURLNormalizer > - fix bracketing when calculating hash code of CrawlDatum -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2852) Method invokes System.exit(...) 9 bugs
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2852. --- > Method invokes System.exit(...) 9 bugs > -- > > Key: NUTCH-2852 > URL: https://issues.apache.org/jira/browse/NUTCH-2852 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > org.apache.nutch.indexer.IndexingFiltersChecker since first historized release > In class org.apache.nutch.indexer.IndexingFiltersChecker > In method org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) > At IndexingFiltersChecker.java:[line 96] > Another occurrence at IndexingFiltersChecker.java:[line 129] > org.apache.nutch.indexer.IndexingFiltersChecker.run(String[]) invokes > System.exit(...), which shuts down the entire virtual machine > Invoking System.exit shuts down the entire Java virtual machine. This should > only been done when it is appropriate. Such calls make it hard or impossible > for your code to be invoked by other code. Consider throwing a > RuntimeException instead. > Also occurs in >org.apache.nutch.net.URLFilterChecker since first historized release >org.apache.nutch.net.URLNormalizerChecker since first historized release >org.apache.nutch.parse.ParseSegment since first historized release >org.apache.nutch.parse.ParserChecker since first historized release >org.apache.nutch.service.NutchServer since first historized release >org.apache.nutch.tools.CommonCrawlDataDumper since first historized release >org.apache.nutch.tools.DmozParser since first historized release >org.apache.nutch.util.AbstractChecker since first historized release -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2819) Move spotbugs "installation" directory to avoid that spotbugs is shipped in Nutch runtime
[ https://issues.apache.org/jira/browse/NUTCH-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2819. --- > Move spotbugs "installation" directory to avoid that spotbugs is shipped in > Nutch runtime > - > > Key: NUTCH-2819 > URL: https://issues.apache.org/jira/browse/NUTCH-2819 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.18 >Reporter: Sebastian Nagel >Assignee: Shashanka Balakuntala Srinivasa >Priority: Minor > Fix For: 1.19 > > > With NUTCH-2816 the Spotbugs tool is "installed" in lib/. However, files in > lib/ are copied to build/ and runtime/. To avoid that the spotbugs jars are > shipped in runtime and eventually also releases, the spotbugs installation > folder should be moved into a different directory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2851) Random object created and used only once
[ https://issues.apache.org/jira/browse/NUTCH-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2851. --- > Random object created and used only once > > > Key: NUTCH-2851 > URL: https://issues.apache.org/jira/browse/NUTCH-2851 > Project: Nutch > Issue Type: Sub-task > Components: dmoz, generator, indexer, segment >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.19 > > > In class org.apache.nutch.crawl.Generator > In method org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int) > Called method java.util.Random.nextInt() > At Generator.java:[line 1016] > Random object created and used only once in > org.apache.nutch.crawl.Generator.partitionSegment(Path, Path, int) > This code creates a java.util.Random object, uses it to generate one random > number, and then discards the Random object. This produces mediocre quality > random numbers and is inefficient. If possible, rewrite the code so that the > Random object is created once and saved, and each time a new random number is > required invoke a method on the existing Random object to obtain it. > If it is important that the generated Random numbers not be guessable, you > must not create a new Random for each random number; the values are too > easily guessable. You should strongly consider using a > java.security.SecureRandom instead (and avoid allocating a new SecureRandom > for each random number needed). > This bad practice also affects the following > org.apache.nutch.indexer.IndexingJob since first historized release > org.apache.nutch.segment.SegmentReader since first historized release > org.apache.nutch.tools.DmozParser$RDFProcessor since first historized release -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-2850) Method ignores exceptional return value
[ https://issues.apache.org/jira/browse/NUTCH-2850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-2850. --- > Method ignores exceptional return value > --- > > Key: NUTCH-2850 > URL: https://issues.apache.org/jira/browse/NUTCH-2850 > Project: Nutch > Issue Type: Sub-task > Components: dumpers >Affects Versions: 1.18 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.19 > > > In class org.apache.nutch.tools.FileDumper > In method org.apache.nutch.tools.FileDumper.dump(File, File, String[], > boolean, boolean, boolean) > Called method java.io.File.mkdirs() > At FileDumper.java:[line 237] > Exceptional return value of java.io.File.mkdirs() ignored in > org.apache.nutch.tools.FileDumper.dump(File, File, String[], boolean, > boolean, boolean) > This method returns a value that is not checked. The return value should be > checked since it can indicate an unusual or unexpected function execution. > For example, the File.delete() method returns false if the file could not be > successfully deleted (rather than throwing an Exception). If you don't check > the result, you won't notice if the method invocation signals unexpected > behavior by returning an atypical return value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3024) Remove flaky 'dependency check' target
Lewis John McGibbney created NUTCH-3024: --- Summary: Remove flaky 'dependency check' target Key: NUTCH-3024 URL: https://issues.apache.org/jira/browse/NUTCH-3024 Project: Nutch Issue Type: Task Components: build Affects Versions: 1.19 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 I [started a thread|https://lists.apache.org/thread/ol3ssjphdqqxwsxhc65qoqg1dj1kjbxb] covering my observations running the ant _*dependency-check*_ target. It fails unpredictably in both GitHub actions and our trusty Jenkins builds on ci-builds.apache.org. I propose to simply remove this target (and associated configuration) in a bid to clean up some flaky legacy build code. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3023) Use mikepenz/action-junit-report to improve interpretation of failed tests during CI
Lewis John McGibbney created NUTCH-3023: --- Summary: Use mikepenz/action-junit-report to improve interpretation of failed tests during CI Key: NUTCH-3023 URL: https://issues.apache.org/jira/browse/NUTCH-3023 Project: Nutch Issue Type: Task Components: build, test Affects Versions: 1.19 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 The following GitHub action could help improve the interpretation of unit test anomalies during a CI run. [https://github.com/mikepenz/action-junit-report] Rather than having to grep through the GitHub Action log, one could save time by interpreting the comments posted to the PR conversation thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3014. --- Thanks [~snagel] for the review > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3014. - Resolution: Fixed > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3022) Experiment formatting codebase per google-java-format
Lewis John McGibbney created NUTCH-3022: --- Summary: Experiment formatting codebase per google-java-format Key: NUTCH-3022 URL: https://issues.apache.org/jira/browse/NUTCH-3022 Project: Nutch Issue Type: Task Components: build Affects Versions: 1.19 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 I [started a mailing list thread|https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n] which quizzed whether we should integrate code linting/formatting into the CI. Seb provided some excellent, calculated input which inspired me to create this ticket. I will create a PR which lints the Nutcj codebase per the *google-java-format* and discuss the results. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3014 stopped by Lewis John McGibbney. --- > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work stopped] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3015 stopped by Lewis John McGibbney. --- > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-3015. --- > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-3015. - Resolution: Fixed > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-2887) Migrate to JUnit 5 Jupiter
[ https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2887 started by Lewis John McGibbney. --- > Migrate to JUnit 5 Jupiter > -- > > Key: NUTCH-2887 > URL: https://issues.apache.org/jira/browse/NUTCH-2887 > Project: Nutch > Issue Type: Improvement > Components: test > Environment: Migrate >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > This effort is a bit of a beast. See the [JUnit migration > tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips] > for general guidance. A general grep for junit in src produces the following > {code:bash} > ./test/nutch-site.xml > ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java > ./test/org/apache/nutch/net/TestURLNormalizers.java > ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java > ./test/org/apache/nutch/net/TestURLFilters.java > ./test/org/apache/nutch/util/TestStringUtil.java > ./test/org/apache/nutch/util/TestSuffixStringMatcher.java > ./test/org/apache/nutch/util/TestEncodingDetector.java > ./test/org/apache/nutch/util/TestMimeUtil.java > ./test/org/apache/nutch/util/TestPrefixStringMatcher.java > ./test/org/apache/nutch/util/DumpFileUtilTest.java > ./test/org/apache/nutch/util/TestNodeWalker.java > ./test/org/apache/nutch/util/WritableTestUtils.java > ./test/org/apache/nutch/util/TestTableUtil.java > ./test/org/apache/nutch/util/TestURLUtil.java > ./test/org/apache/nutch/util/TestGZIPUtils.java > ./test/org/apache/nutch/parse/TestParseText.java > ./test/org/apache/nutch/parse/TestOutlinks.java > ./test/org/apache/nutch/parse/TestParseData.java > ./test/org/apache/nutch/parse/TestOutlinkExtractor.java > ./test/org/apache/nutch/parse/TestParserFactory.java > ./test/org/apache/nutch/segment/TestSegmentMerger.java > ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java > ./test/org/apache/nutch/plugin/TestPluginSystem.java > ./test/org/apache/nutch/fetcher/TestFetcher.java > ./test/org/apache/nutch/protocol/TestProtocolFactory.java > ./test/org/apache/nutch/protocol/TestContent.java > ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java > ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java > ./test/org/apache/nutch/crawl/TestTextProfileSignature.java > ./test/org/apache/nutch/crawl/TestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestGenerator.java > ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java > ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestSignatureFactory.java > ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java > ./test/org/apache/nutch/crawl/TestInjector.java > ./test/org/apache/nutch/crawl/TestLinkDbMerger.java > ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java > ./test/org/apache/nutch/service/TestNutchServer.java > ./test/org/apache/nutch/metadata/TestMetadata.java > ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java > ./test/org/apache/nutch/indexer/TestIndexingFilters.java > ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java > ./bin/nutch > ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java > ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java > ./plugin/urlfilter-domaindenylist/build.xml > ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java > ./plugin/protocol-imaps/plugin.xml > ./plugin/protocol-imaps/ivy.xml > ./plugin/protocol-imaps/lib/junit-4.13.jar > ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar > ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar > ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java > ./plugin/protocol-file/build.xml > ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java > ./plugin/urlnormalizer-regex/build.xml > ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java > ./plugin/build-plugin.xml > ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java > ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > ./plugin/urlnormalizer-protocol/build.xml > ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java > ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java > ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java > ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java > ./plugin/parse-ext/src/test/org/apache/nutc
[jira] [Created] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2
Lewis John McGibbney created NUTCH-3016: --- Summary: Upgrade Apache Ivy to 2.5.2 Key: NUTCH-3016 URL: https://issues.apache.org/jira/browse/NUTCH-3016 Project: Nutch Issue Type: Task Components: ivy, build Affects Versions: 1.19 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 [Apache Ivy v2.5.2|https://ant.apache.org/ivy/history/2.5.2/release-notes.html] was released on August 20 2023! We should upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (NUTCH-2887) Migrate to JUnit 5 Jupiter
[ https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2887: --- Assignee: Lewis John McGibbney > Migrate to JUnit 5 Jupiter > -- > > Key: NUTCH-2887 > URL: https://issues.apache.org/jira/browse/NUTCH-2887 > Project: Nutch > Issue Type: Improvement > Components: test > Environment: Migrate >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > This effort is a bit of a beast. See the [JUnit migration > tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips] > for general guidance. A general grep for junit in src produces the following > {code:bash} > ./test/nutch-site.xml > ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java > ./test/org/apache/nutch/net/TestURLNormalizers.java > ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java > ./test/org/apache/nutch/net/TestURLFilters.java > ./test/org/apache/nutch/util/TestStringUtil.java > ./test/org/apache/nutch/util/TestSuffixStringMatcher.java > ./test/org/apache/nutch/util/TestEncodingDetector.java > ./test/org/apache/nutch/util/TestMimeUtil.java > ./test/org/apache/nutch/util/TestPrefixStringMatcher.java > ./test/org/apache/nutch/util/DumpFileUtilTest.java > ./test/org/apache/nutch/util/TestNodeWalker.java > ./test/org/apache/nutch/util/WritableTestUtils.java > ./test/org/apache/nutch/util/TestTableUtil.java > ./test/org/apache/nutch/util/TestURLUtil.java > ./test/org/apache/nutch/util/TestGZIPUtils.java > ./test/org/apache/nutch/parse/TestParseText.java > ./test/org/apache/nutch/parse/TestOutlinks.java > ./test/org/apache/nutch/parse/TestParseData.java > ./test/org/apache/nutch/parse/TestOutlinkExtractor.java > ./test/org/apache/nutch/parse/TestParserFactory.java > ./test/org/apache/nutch/segment/TestSegmentMerger.java > ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java > ./test/org/apache/nutch/plugin/TestPluginSystem.java > ./test/org/apache/nutch/fetcher/TestFetcher.java > ./test/org/apache/nutch/protocol/TestProtocolFactory.java > ./test/org/apache/nutch/protocol/TestContent.java > ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java > ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java > ./test/org/apache/nutch/crawl/TestTextProfileSignature.java > ./test/org/apache/nutch/crawl/TestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestGenerator.java > ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java > ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestSignatureFactory.java > ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java > ./test/org/apache/nutch/crawl/TestInjector.java > ./test/org/apache/nutch/crawl/TestLinkDbMerger.java > ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java > ./test/org/apache/nutch/service/TestNutchServer.java > ./test/org/apache/nutch/metadata/TestMetadata.java > ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java > ./test/org/apache/nutch/indexer/TestIndexingFilters.java > ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java > ./bin/nutch > ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java > ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java > ./plugin/urlfilter-domaindenylist/build.xml > ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java > ./plugin/protocol-imaps/plugin.xml > ./plugin/protocol-imaps/ivy.xml > ./plugin/protocol-imaps/lib/junit-4.13.jar > ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar > ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar > ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java > ./plugin/protocol-file/build.xml > ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java > ./plugin/urlnormalizer-regex/build.xml > ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java > ./plugin/build-plugin.xml > ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java > ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > ./plugin/urlnormalizer-protocol/build.xml > ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java > ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java > ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java > ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java > ./plugin/parse-ext/src
[jira] [Work started] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3015 started by Lewis John McGibbney. --- > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3014 started by Lewis John McGibbney. --- > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
Lewis John McGibbney created NUTCH-3015: --- Summary: Add more CI steps to GitHub master-build.yml Key: NUTCH-3015 URL: https://issues.apache.org/jira/browse/NUTCH-3015 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.19 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.20 With specific reference to the GitHub master-build.yml, we currently we run _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if something fails it is unclear as to exactly what. There are several improvements I want to propose to the GitHub CI * run workflows against in multiple Environments/OS e.g. ubuntu, macos & windows * define multiple jobs which can run in parallel to speed up CI e.g. javadoc and nightly targets * run more targets e.g. linting, rat-sources, report-vulnerabilities, report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3014: Description: There is a large degree of variability when we set the job name}}{}}} {{Job job = NutchJob.getInstance(getConf());}} {{job.setJobName("read " + segment);}} Some examples mention the job name, others don't. Some use upper case, others don't, etc. I think we can standardize the NutchJob job names. This would help when filtering jobs in YARN ResourceManager UI as well. I propose we implement the following convention * *Nutch* (mandatory) - static value which prepends the job name, assists with distinguishing the Job as a NutchJob and making it easily findable. * *${ClassName}* (mandatory) - literally the name of the Class the job is encoded in * *${additional info}* (optional) - value could further distinguish the type of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) _{*}Nutch ${ClassName}{*}: *${additional info}*_ _Examples:_ * _Nutch LinkRank: Inverter_ * _Nutch CrawlDb: + $crawldb_ * _Nutch LinkDbReader: + $linkdb_ Thanks for any suggestions/comments. was: There is a large degree of variability when we set the job name}}{}}} {{Job job = NutchJob.getInstance(getConf());}} {{job.setJobName("read " + segment);}} Some examples mention the job name, others don't. Some use upper case, others don't, etc. I think we can standardize the NutchJob job names. This would help when filtering jobs in YARN ResourceManager UI as well. I propose we implement the following convention * *Nutch* (mandatory) - static value which prepends the job name, assists with distinguishing the Job as a NutchJob and making it easily findable. * *${ClassName}* (mandatory) - literally the name of the Class the job is encoded in * *${additional info}* (optional) - value could further distinguish the type of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) _{*}Nutch ${ClassName}{*}: *${additional info}*_ _Examples:_ * _Nutch LinkRank Inverter_ * _Nutch CrawlDb + $crawldb_ * _Nutch LinkDbReader + $linkdb_ Thanks for any suggestions/comments. > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-3014: Description: There is a large degree of variability when we set the job name}}{}}} {{Job job = NutchJob.getInstance(getConf());}} {{job.setJobName("read " + segment);}} Some examples mention the job name, others don't. Some use upper case, others don't, etc. I think we can standardize the NutchJob job names. This would help when filtering jobs in YARN ResourceManager UI as well. I propose we implement the following convention * *Nutch* (mandatory) - static value which prepends the job name, assists with distinguishing the Job as a NutchJob and making it easily findable. * *${ClassName}* (mandatory) - literally the name of the Class the job is encoded in * *${additional info}* (optional) - value could further distinguish the type of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) _{*}Nutch ${ClassName}{*}: *${additional info}*_ _Examples:_ * _Nutch LinkRank Inverter_ * _Nutch CrawlDb + $crawldb_ * _Nutch LinkDbReader + $linkdb_ Thanks for any suggestions/comments. was: There is a large degree of variability when we set the job name{{{}{}}} {{Job job = NutchJob.getInstance(getConf());}} {{job.setJobName("read " + segment);}} Some examples mention the job name, others don't. Some use upper case, others don't, etc. I think we can standardize the NutchJob job names. This would help when filtering jobs in YARN ResourceManager UI as well. I propose we implement the following convention * *Nutch* (mandatory) - static value which prepends the job name, assists with distinguishing the Job as a NutchJob and making it easily findable. * *${ClassName}* (mandatory) - literally the name of the Class the job is encoded in * *${additional info}* (optional) - value could further distinguish the type of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) _*Nutch ${ClassName}* *${additional info}*_ _Examples:_ * _Nutch LinkRank Inverter_ * _Nutch CrawlDb + $crawldb_ * _Nutch LinkDbReader + $linkdb_ Thanks for any suggestions/comments. > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank Inverter_ > * _Nutch CrawlDb + $crawldb_ > * _Nutch LinkDbReader + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)