[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241584#comment-16241584
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel commented on issue #239: NUTCH-2442 Injector to stop if job 
fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#issuecomment-342390895
 
 
   Yes, that's clear. But the initial challenge is to port the complex build 
system to Maven.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241419#comment-16241419
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

kpm1985 commented on issue #239: NUTCH-2442 Injector to stop if job fails to 
avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#issuecomment-342363973
 
 
   This is somewhat of a tangent, but if we mavenize the build, we could use
   some nice plugins for maven. Saw a few emails this week on this topic.
   
   I looked into using nailgun and some other sweet things to ensure
   consistency on a per commit basis, ie the maven import ordering plugin ( in
   use under ASFLv2 in Fluo) for Log4J2.
   
   Obviously, having each developer care enough to learn to import the style
   into eclipse is good, a standard imo, but in the future we can automate
   these styles. This is fantastic for formatting older class files etc..
   
   Also, been lagging have lots of work but am serious about fixing old PR
   (fyi for Sebastian)
   
   Afaik ant is preferred less to maven.
   
   On Nov 6, 2017 1:40 PM, "Sebastian Nagel"  wrote:
   
   > Merged #239 .
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > , or mute the
   > thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240970#comment-16240970
 ] 

Hudson commented on NUTCH-2442:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3467 (See 
[https://builds.apache.org/job/Nutch-trunk/3467/])
NUTCH-2442 Injector to stop if job fails to avoid loss of CrawlDb 
(omkarreddy2008: 
[https://github.com/apache/nutch/commit/2352f9a4f47693cd8ca653f0b0629d186593fc4a])
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240954#comment-16240954
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel closed pull request #239: NUTCH-2442 Injector to stop if job 
fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/Injector.java 
b/src/java/org/apache/nutch/crawl/Injector.java
index 5f5fd15ff..d872dbff5 100644
--- a/src/java/org/apache/nutch/crawl/Injector.java
+++ b/src/java/org/apache/nutch/crawl/Injector.java
@@ -414,7 +414,16 @@ public void inject(Path crawlDb, Path urlDir, boolean 
overwrite,
 
 try {
   // run the job
-  job.waitForCompletion(true);
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = "Injector job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+cleanupAfterFailure(tempCrawlDb, lock, fs);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
 
   // save output and perform cleanup
   CrawlDb.install(job, crawlDb);
@@ -452,11 +461,21 @@ public void inject(Path crawlDb, Path urlDir, boolean 
overwrite,
 LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: "
 + TimingUtil.elapsedTime(start, end));
   }
-} catch (IOException e) {
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error("Injector job failed", e);
+  cleanupAfterFailure(tempCrawlDb, lock, fs);
+  throw e;
+}
+  }
+
+  public void cleanupAfterFailure(Path tempCrawlDb, Path lock, FileSystem fs)
+  throws IOException {
+try {
   if (fs.exists(tempCrawlDb)) {
 fs.delete(tempCrawlDb, true);
   }
-  LockUtil.removeLockFile(conf, lock);
+  LockUtil.removeLockFile(fs, lock);
+} catch (IOException e) {
   throw e;
 }
   }
diff --git a/src/java/org/apache/nutch/hostdb/ReadHostDb.java 
b/src/java/org/apache/nutch/hostdb/ReadHostDb.java
index 28a7eb709..eac3bf645 100644
--- a/src/java/org/apache/nutch/hostdb/ReadHostDb.java
+++ b/src/java/org/apache/nutch/hostdb/ReadHostDb.java
@@ -202,8 +202,17 @@ private void readHostDb(Path hostDb, Path output, boolean 
dumpHomepages, boolean
 job.setNumReduceTasks(0);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = "ReadHostDb job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error("ReadHostDb job failed", e);
   throw e;
 }
 
diff --git a/src/java/org/apache/nutch/util/CrawlCompletionStats.java 
b/src/java/org/apache/nutch/util/CrawlCompletionStats.java
index 4920fbf32..116c3113d 100644
--- a/src/java/org/apache/nutch/util/CrawlCompletionStats.java
+++ b/src/java/org/apache/nutch/util/CrawlCompletionStats.java
@@ -171,8 +171,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = jobName + " job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error(jobName + " job failed");
   throw e;
 }
 
diff --git a/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java 
b/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
index a18860634..7e241ffdf 100644
--- a/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
+++ b/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
@@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {

[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240650#comment-16240650
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

Omkar20895 commented on a change in pull request #239: NUTCH-2442 Injector to 
stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r149160152
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   Apologies for such a silly query  . I use vim editor and everything lgtm, 
but, after importing the project into eclipse I could see the differences in 
the format. Thanks.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240036#comment-16240036
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel commented on a change in pull request #239: NUTCH-2442 Injector 
to stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r149020659
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   There is a missing space before opening brackets/braces (also other 
occurrences). The formatting should follow the style defined by the Eclipse 
Code Formatter rules 
(https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml). The 
easiest way is to import the rules into Eclipse (and other IDEs), in doubt, 
Eclipse allows to format the code from the command-line.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239943#comment-16239943
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

Omkar20895 commented on a change in pull request #239: NUTCH-2442 Injector to 
stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r148997590
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   @sebastian-nagel I did not understand, the formatting looks good to me. Can 
you please elaborate on what I am missing here? Thanks. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239222#comment-16239222
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel commented on a change in pull request #239: NUTCH-2442 Injector 
to stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r148940951
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   Please use consistent and uniform formatting.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239121#comment-16239121
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

Omkar20895 opened a new pull request #239: NUTCH-2442 Injector to stop if job 
fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239
 
 
   - Added Job status checks in the classes: Injector, ReadHostDb, 
CrawlCompletionStats, ProtocolStatusStatistics, SitemapProcessor and 
DomainStatistics. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-03 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238510#comment-16238510
 ] 

Sebastian Nagel commented on NUTCH-2442:


Yes, that's it essentially, plus making sure that all exceptions are caught. 
I've also did some work on it, fixed Injector 
([2406c3e|https://github.com/sebastian-nagel/nutch/commit/2406c3e0b7605830321b3a30f81f42a1d197b80a],
 but didn't continue with the other classes. Feel free to take over. But I 
would prefer to fix this separately from NUTCH-2375: it's better to pick the 
lower hanging fruits first. Thanks!

> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-03 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238278#comment-16238278
 ] 

Omkar Reddy commented on NUTCH-2442:


[~wastl-nagel] I am working on this on my local branch of NUTCH-2375. Just so 
that I do not head in the wrong direction, I was thinking the fix should be in 
the following manner :

try{
  boolean complete = job.waitForCompletion(true);
  if(!complete){
''' cleanup statements to revert any significant changes that happened 
during or before the job.'''
throw new Exception(" FAILED.");
  }
}catch(Exception e){
  throw e;
}

Please let me know if I need to add anything else or if there is any 
discrepancy in what I am doing above. Thanks. 


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-10-17 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207219#comment-16207219
 ] 

Sebastian Nagel commented on NUTCH-2442:


Actually, it's a couple of jobs based on the new MapReduce API which do not 
check the return value of {{job.waitForCompletion(true)}} or call 
{{job.isSuccessful()}}. Cf. also the discussion in 
[NUTCH-2375|https://issues.apache.org/jira/browse/NUTCH-2375?focusedCommentId=16184721=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16184721].
 I'm working on a fix for the existing jobs (Injector, SitemapProcessor, 
ReadHostDb, and 3 classes in o.a.nutch.util).

> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)