[jira] [Assigned] (NUTCH-2887) Migrate to JUnit 5 Jupiter

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-2887:
---

Assignee: Lewis John McGibbney

> Migrate to JUnit 5 Jupiter
> --
>
> Key: NUTCH-2887
> URL: https://issues.apache.org/jira/browse/NUTCH-2887
> Project: Nutch
>  Issue Type: Improvement
>  Components: test
> Environment: Migrate 
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.20
>
>
> This effort is a bit of a beast. See the [JUnit migration 
> tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips]
>  for general guidance. A general grep for junit in src produces the following
> {code:bash}
> ./test/nutch-site.xml
> ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java
> ./test/org/apache/nutch/net/TestURLNormalizers.java
> ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java
> ./test/org/apache/nutch/net/TestURLFilters.java
> ./test/org/apache/nutch/util/TestStringUtil.java
> ./test/org/apache/nutch/util/TestSuffixStringMatcher.java
> ./test/org/apache/nutch/util/TestEncodingDetector.java
> ./test/org/apache/nutch/util/TestMimeUtil.java
> ./test/org/apache/nutch/util/TestPrefixStringMatcher.java
> ./test/org/apache/nutch/util/DumpFileUtilTest.java
> ./test/org/apache/nutch/util/TestNodeWalker.java
> ./test/org/apache/nutch/util/WritableTestUtils.java
> ./test/org/apache/nutch/util/TestTableUtil.java
> ./test/org/apache/nutch/util/TestURLUtil.java
> ./test/org/apache/nutch/util/TestGZIPUtils.java
> ./test/org/apache/nutch/parse/TestParseText.java
> ./test/org/apache/nutch/parse/TestOutlinks.java
> ./test/org/apache/nutch/parse/TestParseData.java
> ./test/org/apache/nutch/parse/TestOutlinkExtractor.java
> ./test/org/apache/nutch/parse/TestParserFactory.java
> ./test/org/apache/nutch/segment/TestSegmentMerger.java
> ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java
> ./test/org/apache/nutch/plugin/TestPluginSystem.java
> ./test/org/apache/nutch/fetcher/TestFetcher.java
> ./test/org/apache/nutch/protocol/TestProtocolFactory.java
> ./test/org/apache/nutch/protocol/TestContent.java
> ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java
> ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java
> ./test/org/apache/nutch/crawl/TestTextProfileSignature.java
> ./test/org/apache/nutch/crawl/TestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestGenerator.java
> ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
> ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
> ./test/org/apache/nutch/crawl/TestSignatureFactory.java
> ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
> ./test/org/apache/nutch/crawl/TestInjector.java
> ./test/org/apache/nutch/crawl/TestLinkDbMerger.java
> ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java
> ./test/org/apache/nutch/service/TestNutchServer.java
> ./test/org/apache/nutch/metadata/TestMetadata.java
> ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
> ./test/org/apache/nutch/indexer/TestIndexingFilters.java
> ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java
> ./bin/nutch
> ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java
> ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java
> ./plugin/urlfilter-domaindenylist/build.xml
> ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java
> ./plugin/protocol-imaps/plugin.xml
> ./plugin/protocol-imaps/ivy.xml
> ./plugin/protocol-imaps/lib/junit-4.13.jar
> ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar
> ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar
> ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java
> ./plugin/protocol-file/build.xml
> ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java
> ./plugin/urlnormalizer-regex/build.xml
> ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
> ./plugin/build-plugin.xml
> ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
> ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java
> ./plugin/urlnormalizer-protocol/build.xml
> ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java
> ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java
> ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java
> ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
> 

[jira] [Commented] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778813#comment-17778813
 ] 

ASF GitHub Bot commented on NUTCH-3015:
---

lewismc commented on PR #790:
URL: https://github.com/apache/nutch/pull/790#issuecomment-1775944455

   I realize that this is a pretty HUGE pull request but I will qualify that by 
saying that absolutely no functionality has been changed here. The only changes 
are with the GitHub CI.




> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]

2023-10-23 Thread via GitHub


lewismc commented on PR #790:
URL: https://github.com/apache/nutch/pull/790#issuecomment-1775944455

   I realize that this is a pretty HUGE pull request but I will qualify that by 
saying that absolutely no functionality has been changed here. The only changes 
are with the GitHub CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778812#comment-17778812
 ] 

ASF GitHub Bot commented on NUTCH-3015:
---

lewismc commented on PR #790:
URL: https://github.com/apache/nutch/pull/790#issuecomment-1775942988

   CI has stabilized and we now have passing builds for ubuntu and macos. 
Windows builds were failing so I just disabled them... I can add them back in 
though if we want to...?
   
   I also [kicked off a conversation about 
linting](https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n) 
which we could add in a future PR.




> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]

2023-10-23 Thread via GitHub


lewismc commented on PR #790:
URL: https://github.com/apache/nutch/pull/790#issuecomment-1775942988

   CI has stabilized and we now have passing builds for ubuntu and macos. 
Windows builds were failing so I just disabled them... I can add them back in 
though if we want to...?
   
   I also [kicked off a conversation about 
linting](https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n) 
which we could add in a future PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Work started] (NUTCH-3015) Add more CI steps to GitHub master-build.yml

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3015 started by Lewis John McGibbney.
---
> Add more CI steps to GitHub master-build.yml
> 
>
> Key: NUTCH-3015
> URL: https://issues.apache.org/jira/browse/NUTCH-3015
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> With specific reference to the GitHub master-build.yml, we currently we run 
> _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if 
> something fails it is unclear as to exactly what.
>  
> There are several improvements I want to propose to the GitHub CI
>  * run workflows against in multiple Environments/OS e.g. ubuntu, macos & 
> windows
>  * define multiple jobs which can run in parallel to speed up CI e.g. javadoc 
> and nightly targets
>  * run more targets e.g. linting, rat-sources, report-vulnerabilities, 
> report-licenses, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (NUTCH-3014) Standardize Job names

2023-10-23 Thread Lewis John McGibbney (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3014 started by Lewis John McGibbney.
---
> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Nutch codebase formatting

2023-10-23 Thread lewis john mcgibbney
Hi dev@,

For the longest time the Nutch codebase has shipped with a
eclipse-codeformat.xml [0] file.
Whilst this has been largely successful in keeping the codebase uniform, it
cannot/has not been integrated into continuous integration (CI)  and
subsequently not really enforced!

Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we
should have some CI code formatting checks. Additionally I really question
whether we need a Nutch custom code style at all… why don’t we just use
some other existing style and then enforce it?

I therefore propose that we replace the legacy code formatter with a
convention such as

* google Java format [1] which offers a GitHub action for easy integration
into our CI process, or
* check style [2] which offers an Ant task which we could use, this is of
less utility as we think about the move to grade
* superlinter [3] basically emerging as the industry OSS default, offers a
GitHub action and could also be configured to lint dockerfile, and other
artifacts. It can also be configured to use the google Java style as well…

My preference would be [3] because it offers a more comprehensive linting
package for the entire codebase not just the Java code.

Thanks for your consideration.
lewismc

[0]
https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml
[1]
https://github.com/google/google-java-format
[2]
https://checkstyle.sourceforge.io/
[3]
https://github.com/marketplace/actions/super-linter