[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782403#comment-17782403 ] Hudson commented on NUTCH-3014: --- FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #136 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/136/]) NUTCH-3014 Standardize Job names (#789) (github: [https://github.com/apache/nutch/commit/bbf0867263ed1764c56fe7794c17942d0e8bf1c4]) * (edit) src/java/org/apache/nutch/crawl/LinkDb.java * (edit) src/java/org/apache/nutch/crawl/Generator.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java * (edit) src/java/org/apache/nutch/crawl/Injector.java * (edit) src/test/org/apache/nutch/plugin/TestPluginSystem.java * (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java * (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java * (edit) src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java * (edit) src/java/org/apache/nutch/indexer/CleaningJob.java * (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java * (edit) src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java * (edit) src/java/org/apache/nutch/segment/SegmentMerger.java * (edit) src/java/org/apache/nutch/util/NutchJob.java * (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java * (edit) src/java/org/apache/nutch/util/SitemapProcessor.java * (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java * (edit) src/java/org/apache/nutch/tools/warc/WARCExporter.java * (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java * (edit) src/java/org/apache/nutch/fetcher/Fetcher.java * (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java * (edit) src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java * (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java * (edit) src/java/org/apache/nutch/tools/FreeGenerator.java * (edit) src/java/org/apache/nutch/segment/SegmentReader.java * (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java * (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java * (edit) src/java/org/apache/nutch/crawl/CrawlDb.java * (edit) src/java/org/apache/nutch/parse/ParseSegment.java * (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java * (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782392#comment-17782392 ] ASF GitHub Bot commented on NUTCH-3014: --- lewismc merged PR #789: URL: https://github.com/apache/nutch/pull/789 > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782382#comment-17782382 ] ASF GitHub Bot commented on NUTCH-3014: --- lewismc commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r138646 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override protected int process(String line, StringBuilder output) throws Exception { -Job job = NutchJob.getInstance(getConf()); +Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + this.crawlDb); Review Comment: Thanks @sebastian-nagel > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780720#comment-17780720 ] ASF GitHub Bot commented on NUTCH-3014: --- sebastian-nagel commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override protected int process(String line, StringBuilder output) throws Exception { -Job job = NutchJob.getInstance(getConf()); +Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + this.crawlDb); Review Comment: `this` isn't really required here. > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778448#comment-17778448 ] ASF GitHub Bot commented on NUTCH-3014: --- lewismc opened a new pull request, #789: URL: https://github.com/apache/nutch/pull/789 Addresses https://issues.apache.org/jira/browse/NUTCH-3014 First pass at giving the Nutch Job names a bit of a face-lift. Additional notes: - I literally removed `NutchJob.getInstance()`... I didn't see much point in it. Maybe I am wrong. - TestFetcher is failing locally... looking into that now. Will let CI run and see if we can reproduce this test failure. > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name{{{}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _*Nutch ${ClassName}* *${additional info}*_ > _Examples:_ > * _Nutch LinkRank Inverter_ > * _Nutch CrawlDb + $crawldb_ > * _Nutch LinkDbReader + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)