[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-11-02 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782403#comment-17782403
 ] 

Hudson commented on NUTCH-3014:
---

FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #136 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/136/])
NUTCH-3014 Standardize Job names (#789) (github: 
[https://github.com/apache/nutch/commit/bbf0867263ed1764c56fe7794c17942d0e8bf1c4])
* (edit) src/java/org/apache/nutch/crawl/LinkDb.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/test/org/apache/nutch/plugin/TestPluginSystem.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/test/org/apache/nutch/crawl/TestCrawlDbFilter.java
* (edit) src/java/org/apache/nutch/indexer/CleaningJob.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) src/java/org/apache/nutch/util/NutchJob.java
* (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbReader.java
* (edit) src/java/org/apache/nutch/tools/warc/WARCExporter.java
* (edit) src/java/org/apache/nutch/crawl/LinkDbMerger.java
* (edit) src/java/org/apache/nutch/fetcher/Fetcher.java
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java
* (edit) src/java/org/apache/nutch/tools/FreeGenerator.java
* (edit) src/java/org/apache/nutch/segment/SegmentReader.java
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java


> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-11-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782392#comment-17782392
 ] 

ASF GitHub Bot commented on NUTCH-3014:
---

lewismc merged PR #789:
URL: https://github.com/apache/nutch/pull/789




> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-11-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782382#comment-17782382
 ] 

ASF GitHub Bot commented on NUTCH-3014:
---

lewismc commented on code in PR #789:
URL: https://github.com/apache/nutch/pull/789#discussion_r138646


##
src/java/org/apache/nutch/crawl/CrawlDbReader.java:
##
@@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, 
Configuration config)
 
   @Override
   protected int process(String line, StringBuilder output) throws Exception {
-Job job = NutchJob.getInstance(getConf());
+Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + 
this.crawlDb);

Review Comment:
   Thanks @sebastian-nagel 





> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-10-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780720#comment-17780720
 ] 

ASF GitHub Bot commented on NUTCH-3014:
---

sebastian-nagel commented on code in PR #789:
URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979


##
src/java/org/apache/nutch/crawl/CrawlDbReader.java:
##
@@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, 
Configuration config)
 
   @Override
   protected int process(String line, StringBuilder output) throws Exception {
-Job job = NutchJob.getInstance(getConf());
+Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + 
this.crawlDb);

Review Comment:
   `this` isn't really required here.





> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name}}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _{*}Nutch ${ClassName}{*}: *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank: Inverter_
>  * _Nutch CrawlDb: + $crawldb_
>  * _Nutch LinkDbReader: + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3014) Standardize Job names

2023-10-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778448#comment-17778448
 ] 

ASF GitHub Bot commented on NUTCH-3014:
---

lewismc opened a new pull request, #789:
URL: https://github.com/apache/nutch/pull/789

   Addresses https://issues.apache.org/jira/browse/NUTCH-3014
   
   First pass at giving the Nutch Job names a bit of a face-lift.
   
   Additional notes: 
   
   - I literally removed `NutchJob.getInstance()`... I didn't see much point in 
it. Maybe I am wrong.
   - TestFetcher is failing locally... looking into that now. Will let CI run 
and see if we can reproduce this test failure.




> Standardize Job names
> -
>
> Key: NUTCH-3014
> URL: https://issues.apache.org/jira/browse/NUTCH-3014
> Project: Nutch
>  Issue Type: Improvement
>  Components: configuration, runtime
>Affects Versions: 1.19
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.20
>
>
> There is a large degree of variability when we set the job name{{{}{}}}
>  
> {{Job job = NutchJob.getInstance(getConf());}}
> {{job.setJobName("read " + segment);}}
>  
> Some examples mention the job name, others don't. Some use upper case, others 
> don't, etc.
> I think we can standardize the NutchJob job names. This would help when 
> filtering jobs in YARN ResourceManager UI as well.
> I propose we implement the following convention
>  * *Nutch* (mandatory) - static value which prepends the job name, assists 
> with distinguishing the Job as a NutchJob and making it easily findable.
>  * *${ClassName}* (mandatory) - literally the name of the Class the job is 
> encoded in
>  * *${additional info}* (optional) - value could further distinguish the type 
> of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.)
> _*Nutch ${ClassName}* *${additional info}*_
> _Examples:_
>  * _Nutch LinkRank Inverter_
>  * _Nutch CrawlDb + $crawldb_
>  * _Nutch LinkDbReader + $linkdb_
> Thanks for any suggestions/comments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)