[jira] [Assigned] (NUTCH-2887) Migrate to JUnit 5 Jupiter
[ https://issues.apache.org/jira/browse/NUTCH-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-2887: --- Assignee: Lewis John McGibbney > Migrate to JUnit 5 Jupiter > -- > > Key: NUTCH-2887 > URL: https://issues.apache.org/jira/browse/NUTCH-2887 > Project: Nutch > Issue Type: Improvement > Components: test > Environment: Migrate >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.20 > > > This effort is a bit of a beast. See the [JUnit migration > tips|https://junit.org/junit5/docs/current/user-guide/#migrating-from-junit4-tips] > for general guidance. A general grep for junit in src produces the following > {code:bash} > ./test/nutch-site.xml > ./test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java > ./test/org/apache/nutch/net/TestURLNormalizers.java > ./test/org/apache/nutch/net/protocols/TestHttpDateFormat.java > ./test/org/apache/nutch/net/TestURLFilters.java > ./test/org/apache/nutch/util/TestStringUtil.java > ./test/org/apache/nutch/util/TestSuffixStringMatcher.java > ./test/org/apache/nutch/util/TestEncodingDetector.java > ./test/org/apache/nutch/util/TestMimeUtil.java > ./test/org/apache/nutch/util/TestPrefixStringMatcher.java > ./test/org/apache/nutch/util/DumpFileUtilTest.java > ./test/org/apache/nutch/util/TestNodeWalker.java > ./test/org/apache/nutch/util/WritableTestUtils.java > ./test/org/apache/nutch/util/TestTableUtil.java > ./test/org/apache/nutch/util/TestURLUtil.java > ./test/org/apache/nutch/util/TestGZIPUtils.java > ./test/org/apache/nutch/parse/TestParseText.java > ./test/org/apache/nutch/parse/TestOutlinks.java > ./test/org/apache/nutch/parse/TestParseData.java > ./test/org/apache/nutch/parse/TestOutlinkExtractor.java > ./test/org/apache/nutch/parse/TestParserFactory.java > ./test/org/apache/nutch/segment/TestSegmentMerger.java > ./test/org/apache/nutch/segment/TestSegmentMergerCrawlDatums.java > ./test/org/apache/nutch/plugin/TestPluginSystem.java > ./test/org/apache/nutch/fetcher/TestFetcher.java > ./test/org/apache/nutch/protocol/TestProtocolFactory.java > ./test/org/apache/nutch/protocol/TestContent.java > ./test/org/apache/nutch/protocol/AbstractHttpProtocolPluginTest.java > ./test/org/apache/nutch/crawl/TestCrawlDbFilter.java > ./test/org/apache/nutch/crawl/TestTextProfileSignature.java > ./test/org/apache/nutch/crawl/TestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestGenerator.java > ./test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java > ./test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java > ./test/org/apache/nutch/crawl/TestSignatureFactory.java > ./test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java > ./test/org/apache/nutch/crawl/TestInjector.java > ./test/org/apache/nutch/crawl/TestLinkDbMerger.java > ./test/org/apache/nutch/crawl/TestCrawlDbMerger.java > ./test/org/apache/nutch/service/TestNutchServer.java > ./test/org/apache/nutch/metadata/TestMetadata.java > ./test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java > ./test/org/apache/nutch/indexer/TestIndexingFilters.java > ./test/org/apache/nutch/indexer/TestIndexerMapReduce.java > ./bin/nutch > ./plugin/scoring-orphan/src/test/org/apache/nutch/scoring/orphan/TestOrphanScoringFilter.java > ./plugin/index-basic/src/test/org/apache/nutch/indexer/basic/TestBasicIndexingFilter.java > ./plugin/urlfilter-domaindenylist/build.xml > ./plugin/urlfilter-domaindenylist/src/test/org/apache/nutch/urlfilter/domaindenylist/TestDomainDenylistURLFilter.java > ./plugin/protocol-imaps/plugin.xml > ./plugin/protocol-imaps/ivy.xml > ./plugin/protocol-imaps/lib/junit-4.13.jar > ./plugin/protocol-imaps/lib/greenmail-junit4-1.6.0.jar > ./plugin/protocol-imaps/lib/greenmail-1.6.0.jar > ./plugin/protocol-imaps/src/test/org/apache/nutch/protocol/imaps/TestImaps.java > ./plugin/protocol-file/build.xml > ./plugin/protocol-file/src/test/org/apache/nutch/protocol/file/TestProtocolFile.java > ./plugin/urlnormalizer-regex/build.xml > ./plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java > ./plugin/build-plugin.xml > ./plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java > ./plugin/urlnormalizer-basic/src/test/org/apache/nutch/net/urlnormalizer/basic/TestBasicURLNormalizer.java > ./plugin/urlnormalizer-protocol/build.xml > ./plugin/urlnormalizer-protocol/src/test/org/apache/nutch/net/urlnormalizer/protocol/TestProtocolURLNormalizer.java > ./plugin/urlfilter-prefix/src/test/org/apache/nutch/urlfilter/prefix/TestPrefixURLFilter.java > ./plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java > ./plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java >
[jira] [Commented] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778813#comment-17778813 ] ASF GitHub Bot commented on NUTCH-3015: --- lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775944455 I realize that this is a pretty HUGE pull request but I will qualify that by saying that absolutely no functionality has been changed here. The only changes are with the GitHub CI. > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]
lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775944455 I realize that this is a pretty HUGE pull request but I will qualify that by saying that absolutely no functionality has been changed here. The only changes are with the GitHub CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778812#comment-17778812 ] ASF GitHub Bot commented on NUTCH-3015: --- lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775942988 CI has stabilized and we now have passing builds for ubuntu and macos. Windows builds were failing so I just disabled them... I can add them back in though if we want to...? I also [kicked off a conversation about linting](https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n) which we could add in a future PR. > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]
lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775942988 CI has stabilized and we now have passing builds for ubuntu and macos. Windows builds were failing so I just disabled them... I can add them back in though if we want to...? I also [kicked off a conversation about linting](https://lists.apache.org/thread/ssmm6djyk5syvhmq701zjf0d9bobpk5n) which we could add in a future PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Work started] (NUTCH-3015) Add more CI steps to GitHub master-build.yml
[ https://issues.apache.org/jira/browse/NUTCH-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3015 started by Lewis John McGibbney. --- > Add more CI steps to GitHub master-build.yml > > > Key: NUTCH-3015 > URL: https://issues.apache.org/jira/browse/NUTCH-3015 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > With specific reference to the GitHub master-build.yml, we currently we run > _*ant clean nightly javadoc -buildfile build.xml*_ as one mammoth task and if > something fails it is unclear as to exactly what. > > There are several improvements I want to propose to the GitHub CI > * run workflows against in multiple Environments/OS e.g. ubuntu, macos & > windows > * define multiple jobs which can run in parallel to speed up CI e.g. javadoc > and nightly targets > * run more targets e.g. linting, rat-sources, report-vulnerabilities, > report-licenses, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-3014 started by Lewis John McGibbney. --- > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Nutch codebase formatting
Hi dev@, For the longest time the Nutch codebase has shipped with a eclipse-codeformat.xml [0] file. Whilst this has been largely successful in keeping the codebase uniform, it cannot/has not been integrated into continuous integration (CI) and subsequently not really enforced! Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we should have some CI code formatting checks. Additionally I really question whether we need a Nutch custom code style at all… why don’t we just use some other existing style and then enforce it? I therefore propose that we replace the legacy code formatter with a convention such as * google Java format [1] which offers a GitHub action for easy integration into our CI process, or * check style [2] which offers an Ant task which we could use, this is of less utility as we think about the move to grade * superlinter [3] basically emerging as the industry OSS default, offers a GitHub action and could also be configured to lint dockerfile, and other artifacts. It can also be configured to use the google Java style as well… My preference would be [3] because it offers a more comprehensive linting package for the entire codebase not just the Java code. Thanks for your consideration. lewismc [0] https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml [1] https://github.com/google/google-java-format [2] https://checkstyle.sourceforge.io/ [3] https://github.com/marketplace/actions/super-linter