[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241584#comment-16241584
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel commented on issue #239: NUTCH-2442 Injector to stop if job 
fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#issuecomment-342390895
 
 
   Yes, that's clear. But the initial challenge is to port the complex build 
system to Maven.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241419#comment-16241419
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

kpm1985 commented on issue #239: NUTCH-2442 Injector to stop if job fails to 
avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#issuecomment-342363973
 
 
   This is somewhat of a tangent, but if we mavenize the build, we could use
   some nice plugins for maven. Saw a few emails this week on this topic.
   
   I looked into using nailgun and some other sweet things to ensure
   consistency on a per commit basis, ie the maven import ordering plugin ( in
   use under ASFLv2 in Fluo) for Log4J2.
   
   Obviously, having each developer care enough to learn to import the style
   into eclipse is good, a standard imo, but in the future we can automate
   these styles. This is fantastic for formatting older class files etc..
   
   Also, been lagging have lots of work but am serious about fixing old PR
   (fyi for Sebastian)
   
   Afaik ant is preferred less to maven.
   
   On Nov 6, 2017 1:40 PM, "Sebastian Nagel"  wrote:
   
   > Merged #239 .
   >
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub
   > , or mute the
   > thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NUTCH-2040) Upgrade to recent version of Crawler-Commons

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241007#comment-16241007
 ] 

Sebastian Nagel edited comment on NUTCH-2040 at 11/6/17 10:25 PM:
--

Done for 1.x: upgrade to 0.6 done as part of NUTCH-1486, NUTCH-1465 upgraded to 
0.8.

2.x is still on 0.5


was (Author: wastl-nagel):
Done for 1.x: upgrade to 0.6 done as part of NUTCH-1486, NUTCH-1465 upgraded to 
0.8

> Upgrade to recent version of Crawler-Commons
> 
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2040) Upgrade to recent version of Crawler-Commons

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2040:
---
Summary: Upgrade to recent version of Crawler-Commons  (was: Upgrade to 
Crawler Commons 0.6)

> Upgrade to recent version of Crawler-Commons
> 
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2040) Upgrade to Crawler Commons 0.6

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241007#comment-16241007
 ] 

Sebastian Nagel commented on NUTCH-2040:


Done for 1.x: upgrade to 0.6 done as part of NUTCH-1486, NUTCH-1465 upgraded to 
0.8

> Upgrade to Crawler Commons 0.6
> --
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2040) Upgrade to Crawler Commons 0.6

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2040:
---
Affects Version/s: 2.3.1

> Upgrade to Crawler Commons 0.6
> --
>
> Key: NUTCH-2040
> URL: https://issues.apache.org/jira/browse/NUTCH-2040
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>
> [Crawler Commons 
> 0.6|https://github.com/crawler-commons/crawler-commons#11th-june-2015---crawler-commons-06-is-released]
>  was released. We should upgrade.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2076) exceptions are not handled when using method waitForCompletion in a try block

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241000#comment-16241000
 ] 

Sebastian Nagel commented on NUTCH-2076:


Solution for NUTCH-2442 is applicable.

> exceptions are not handled when using method waitForCompletion in a try block
> -
>
> Key: NUTCH-2076
> URL: https://issues.apache.org/jira/browse/NUTCH-2076
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.2, 2.3
>Reporter: songwanging
>Priority: Minor
>
> Locations: src\java\org\apache\nutch\crawl\WebTableReader.java
> when using function waitForCompletion in a try block, exceptions are not 
> handled :
> waitForCompletion might throw  : IOException, InterruptedException, 
> ClassNotFoundException
> so when calling this function in a try block, we should use a catch block to 
> handle potential Exceptions.
> public Map run(Map args) throws Exception {
> ...
> try {
>   currentJob.waitForCompletion(true);
> } finally {
>   ToolUtil.recordJobStatus(null, currentJob, results);
>   if (!currentJob.isSuccessful()) {
> fileSystem.delete(tmpFolder, true);
> return results;
>   }
> }
> ...
> }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2298) TestCrawlDbStates.testCrawlDbStatTransitionInject broken

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2298.

   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.14)

Closing for now. Please reopen if the problem appears again. Thanks!

> TestCrawlDbStates.testCrawlDbStatTransitionInject broken
> 
>
> Key: NUTCH-2298
> URL: https://issues.apache.org/jira/browse/NUTCH-2298
> Project: Nutch
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>
> No idea what's happening on master:
> {code}Testcase: testCrawlDbStatTransitionInject took 0.105 sec
> Caused an ERROR
> org/mockito/stubbing/Answer
> java.lang.NoClassDefFoundError: org/mockito/stubbing/Answer
> at 
> org.apache.hadoop.mrunit.mapreduce.ReduceDriver.getContextWrapper(ReduceDriver.java:281)
> at 
> org.apache.hadoop.mrunit.mapreduce.ReduceDriver.run(ReduceDriver.java:257)
> at 
> org.apache.nutch.crawl.CrawlDbUpdateTestDriver.update(CrawlDbUpdateTestDriver.java:98)
> at 
> org.apache.nutch.crawl.TestCrawlDbStates.testCrawlDbStatTransitionInject(TestCrawlDbStates.java:233)
> Caused by: java.lang.ClassNotFoundException: org.mockito.stubbing.Answer
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2422) Update information about git repository

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2422:
---
Component/s: website

> Update information about git repository
> ---
>
> Key: NUTCH-2422
> URL: https://issues.apache.org/jira/browse/NUTCH-2422
> Project: Nutch
>  Issue Type: Task
>  Components: website
>Reporter: Karl Richter
>
> https://github.com/apache/nutch claims to be a mirror of 
> git://git.apache.org/nutch.git, but cloning fails due to `fatal: remote 
> error: access denied or repository not exported: /nutch.git` and nutch isn't 
> listed on https://git.apache.org/. That status isn't consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2422) Update information about git repository

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2422.

Resolution: Fixed

Misleading link is removed. Thanks!

> Update information about git repository
> ---
>
> Key: NUTCH-2422
> URL: https://issues.apache.org/jira/browse/NUTCH-2422
> Project: Nutch
>  Issue Type: Task
>Reporter: Karl Richter
>
> https://github.com/apache/nutch claims to be a mirror of 
> git://git.apache.org/nutch.git, but cloning fails due to `fatal: remote 
> error: access denied or repository not exported: /nutch.git` and nutch isn't 
> listed on https://git.apache.org/. That status isn't consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2457) Embedded documents likely not correctly parsed by Tika

2017-11-06 Thread Tim Allison (JIRA)
Tim Allison created NUTCH-2457:
--

 Summary: Embedded documents likely not correctly parsed by Tika
 Key: NUTCH-2457
 URL: https://issues.apache.org/jira/browse/NUTCH-2457
 Project: Nutch
  Issue Type: Bug
Reporter: Tim Allison


While working on TIKA-2490, I think I found that Nutch's current method of 
requesting a mime-specific parser for each file will fail to parse embedded 
files, e.g. 
https://github.com/apache/tika/blob/master/tika-server/src/test/resources/test_recursive_embedded.docx

The fix should be straightforward, and I'll submit a PR once I can get Nutch up 
and running in my dev environment. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


RE: quick start for dev Nutch in Intellij?

2017-11-06 Thread Allison, Timothy B.
Duh, of course.  That did it.  Thank you!

Any standard/easy ways for dealing with Hadoop on Windows:

ERROR util.Shell (Shell.java:getWinUtilsPath(374)) - Failed to locate the 
winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.

I saw this: 
https://stackoverflow.com/questions/35652665/java-io-ioexception-could-not-locate-executable-null-bin-winutils-exe-in-the-ha
 and a few others, but I haven’t quite gotten their recommendations to work 
with Nutch.


From: Jorge Betancourt [mailto:betancourt.jo...@gmail.com]
Sent: Monday, November 6, 2017 3:39 PM
To: dev@nutch.apache.org
Subject: Re: quick start for dev Nutch in Intellij?

Honestly what I usually do is just:

$ ant eclipse

This will create an eclipse project that I import directly int IntelliJ. I also 
install/use IvyIDEA and resolve all the dependencies within IntelliJ for all 
modules. With this setup I can run unit tests from within the IDE, although I 
usually run the full test suite from the terminal (ant test), which also works 
inside the plugins directory.

For the IvyIDE error, check that you have configured the Ivy settings file to 
point to the ivy/ivy.xml file in the Nutch subdirectory.

https://cl.ly/250r1l3g2e2S

Best Regards,
Jorge

On Nov 6, 2017, 6:30 PM +0100, Allison, Timothy B. 
mailto:talli...@mitre.org>>, wrote:

All,
Apologies for the newb-to-nutch question, but is there a quick start guide for 
developing Nutch with Intellij?

There has to be a better way than relying on the build directory: 
https://stackoverflow.com/questions/15357462/how-to-open-an-ant-project-nutch-source-at-intellij-idea

I've tried Project Structure->Modules->IvyIDEA and pointing to ivy/ivy.xml, but 
I'm getting an "Error parsing settings file...if you use properties, configure 
those first"so I tried loading default.properties first with no luck.


Again, apologies, and thank you!

Cheers,

Tim


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240970#comment-16240970
 ] 

Hudson commented on NUTCH-2442:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3467 (See 
[https://builds.apache.org/job/Nutch-trunk/3467/])
NUTCH-2442 Injector to stop if job fails to avoid loss of CrawlDb 
(omkarreddy2008: 
[https://github.com/apache/nutch/commit/2352f9a4f47693cd8ca653f0b0629d186593fc4a])
* (edit) src/java/org/apache/nutch/util/domain/DomainStatistics.java
* (edit) src/java/org/apache/nutch/crawl/Injector.java
* (edit) src/java/org/apache/nutch/util/CrawlCompletionStats.java
* (edit) src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
* (edit) src/java/org/apache/nutch/util/SitemapProcessor.java
* (edit) src/java/org/apache/nutch/hostdb/ReadHostDb.java


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2442.

Resolution: Fixed

Thanks, [~omkar20895]!

> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240954#comment-16240954
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel closed pull request #239: NUTCH-2442 Injector to stop if job 
fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/Injector.java 
b/src/java/org/apache/nutch/crawl/Injector.java
index 5f5fd15ff..d872dbff5 100644
--- a/src/java/org/apache/nutch/crawl/Injector.java
+++ b/src/java/org/apache/nutch/crawl/Injector.java
@@ -414,7 +414,16 @@ public void inject(Path crawlDb, Path urlDir, boolean 
overwrite,
 
 try {
   // run the job
-  job.waitForCompletion(true);
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = "Injector job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+cleanupAfterFailure(tempCrawlDb, lock, fs);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
 
   // save output and perform cleanup
   CrawlDb.install(job, crawlDb);
@@ -452,11 +461,21 @@ public void inject(Path crawlDb, Path urlDir, boolean 
overwrite,
 LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: "
 + TimingUtil.elapsedTime(start, end));
   }
-} catch (IOException e) {
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error("Injector job failed", e);
+  cleanupAfterFailure(tempCrawlDb, lock, fs);
+  throw e;
+}
+  }
+
+  public void cleanupAfterFailure(Path tempCrawlDb, Path lock, FileSystem fs)
+  throws IOException {
+try {
   if (fs.exists(tempCrawlDb)) {
 fs.delete(tempCrawlDb, true);
   }
-  LockUtil.removeLockFile(conf, lock);
+  LockUtil.removeLockFile(fs, lock);
+} catch (IOException e) {
   throw e;
 }
   }
diff --git a/src/java/org/apache/nutch/hostdb/ReadHostDb.java 
b/src/java/org/apache/nutch/hostdb/ReadHostDb.java
index 28a7eb709..eac3bf645 100644
--- a/src/java/org/apache/nutch/hostdb/ReadHostDb.java
+++ b/src/java/org/apache/nutch/hostdb/ReadHostDb.java
@@ -202,8 +202,17 @@ private void readHostDb(Path hostDb, Path output, boolean 
dumpHomepages, boolean
 job.setNumReduceTasks(0);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = "ReadHostDb job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error("ReadHostDb job failed", e);
   throw e;
 }
 
diff --git a/src/java/org/apache/nutch/util/CrawlCompletionStats.java 
b/src/java/org/apache/nutch/util/CrawlCompletionStats.java
index 4920fbf32..116c3113d 100644
--- a/src/java/org/apache/nutch/util/CrawlCompletionStats.java
+++ b/src/java/org/apache/nutch/util/CrawlCompletionStats.java
@@ -171,8 +171,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if (!success) {
+String message = jobName + " job did not succeed, job status: "
++ job.getStatus().getState() + ", reason: "
++ job.getStatus().getFailureInfo();
+LOG.error(message);
+// throw exception so that calling routine can exit with error
+throw new RuntimeException(message);
+  }
+} catch (IOException | InterruptedException | ClassNotFoundException e) {
+  LOG.error(jobName + " job failed");
   throw e;
 }
 
diff --git a/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java 
b/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
index a18860634..7e241ffdf 100644
--- a/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
+++ b/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
@@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch 

Re: quick start for dev Nutch in Intellij?

2017-11-06 Thread Jorge Betancourt
Honestly what I usually do is just:

$ ant eclipse

This will create an eclipse project that I import directly int IntelliJ. I also 
install/use IvyIDEA and resolve all the dependencies within IntelliJ for all 
modules. With this setup I can run unit tests from within the IDE, although I 
usually run the full test suite from the terminal (ant test), which also works 
inside the plugins directory.

For the IvyIDE error, check that you have configured the Ivy settings file to 
point to the ivy/ivy.xml file in the Nutch subdirectory.

https://cl.ly/250r1l3g2e2S

Best Regards,
Jorge

On Nov 6, 2017, 6:30 PM +0100, Allison, Timothy B. , wrote:
> All,
> Apologies for the newb-to-nutch question, but is there a quick start guide 
> for developing Nutch with Intellij?
>
> There has to be a better way than relying on the build directory: 
> https://stackoverflow.com/questions/15357462/how-to-open-an-ant-project-nutch-source-at-intellij-idea
>
> I've tried Project Structure->Modules->IvyIDEA and pointing to ivy/ivy.xml, 
> but I'm getting an "Error parsing settings file...if you use properties, 
> configure those first"so I tried loading default.properties first with no 
> luck.
>
>
> Again, apologies, and thank you!
>
> Cheers,
>
> Tim
>


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240650#comment-16240650
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

Omkar20895 commented on a change in pull request #239: NUTCH-2442 Injector to 
stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r149160152
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   Apologies for such a silly query 😅 . I use vim editor and everything lgtm, 
but, after importing the project into eclipse I could see the differences in 
the format. Thanks.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240651#comment-16240651
 ] 

Hudson commented on NUTCH-2420:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3466 (See 
[https://builds.apache.org/job/Nutch-trunk/3466/])
NUTCH-2420 Bug in variable generate.max.count and fetcher.server.delay (markus: 
[https://github.com/apache/nutch/commit/6199492f5e1e8811022257c88dbf63f1e1c739d0])
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Bug in variable generate.max.count and fetcher.server.delay
> ---
>
> Key: NUTCH-2420
> URL: https://issues.apache.org/jira/browse/NUTCH-2420
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2420.patch
>
>
> Feature added by NUTCH-2368 does not work for multiple hosts. Once a 
> HostDatum has been read by getHostDatum(), the next host cannot be read. 
> Apparantly i need to open and close the SequenceFile.Readers for every 
> HostDatum it needs. Reader has no reset() method or whatsoever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2456) Redirected documents are not indexed

2017-11-06 Thread Yossi Tamari (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yossi Tamari updated NUTCH-2456:

Description: 
If http.redirect.max is set to a positive value, the Fetcher will follow 
redirects, creating a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a 
special case: dbDatum is null. This means that in 
[https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
 the document is not indexed, as it is assumed it only has inlinks (actually it 
has everything but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should 
use AND instead of OR anyway, but I may not understand the original intent. It 
is clear that it is too strict as is.
However, the code following that line assumes all 4 objects are not null, so a 
patch would need to change more than just the condition.

  was:
If http.redirect.max is set to a positive value, the Fetcher will follow 
redirects, creating a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a 
special case: dbDatum is null. This means that in 
[https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
 the document is not indexed, as it is assumed it only has inlinks (actually it 
has everything but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should 
use AND instead of OR anyway, but I may not understand the original intent. It 
is clear that it is too strict as is.


> Redirected documents are not indexed
> 
>
> Key: NUTCH-2456
> URL: https://issues.apache.org/jira/browse/NUTCH-2456
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow 
> redirects, creating a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a 
> special case: dbDatum is null. This means that in 
> [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
>  the document is not indexed, as it is assumed it only has inlinks (actually 
> it has everything but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition 
> should use AND instead of OR anyway, but I may not understand the original 
> intent. It is clear that it is too strict as is.
> However, the code following that line assumes all 4 objects are not null, so 
> a patch would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2456) Redirected documents are not indexed

2017-11-06 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2456:
---

 Summary: Redirected documents are not indexed
 Key: NUTCH-2456
 URL: https://issues.apache.org/jira/browse/NUTCH-2456
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.13
Reporter: Yossi Tamari
Priority: Critical


If http.redirect.max is set to a positive value, the Fetcher will follow 
redirects, creating a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a 
special case: dbDatum is null. This means that in 
[https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
 the document is not indexed, as it is assumed it only has inlinks (actually it 
has everything but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should 
use AND instead of OR anyway, but I may not understand the original intent. It 
is clear that it is too strict as is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


quick start for dev Nutch in Intellij?

2017-11-06 Thread Allison, Timothy B.
All,
  Apologies for the newb-to-nutch question, but is there a quick start guide 
for developing Nutch with Intellij?

  There has to be a better way than relying on the build directory: 
https://stackoverflow.com/questions/15357462/how-to-open-an-ant-project-nutch-source-at-intellij-idea
  
I've tried Project Structure->Modules->IvyIDEA and pointing to ivy/ivy.xml, but 
I'm getting an "Error parsing settings file...if you use properties, configure 
those first"so I tried loading default.properties first with no luck.
  

  Again, apologies, and thank you!

  Cheers,

   Tim



[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240491#comment-16240491
 ] 

Markus Jelsma commented on NUTCH-2420:
--

Committed and created NUTCH-2455

Thanks!

> Bug in variable generate.max.count and fetcher.server.delay
> ---
>
> Key: NUTCH-2420
> URL: https://issues.apache.org/jira/browse/NUTCH-2420
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2420.patch
>
>
> Feature added by NUTCH-2368 does not work for multiple hosts. Once a 
> HostDatum has been read by getHostDatum(), the next host cannot be read. 
> Apparantly i need to open and close the SequenceFile.Readers for every 
> HostDatum it needs. Reader has no reset() method or whatsoever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-06 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2455:
-
Description: 
Citing Sebastian at NUTCH-2420:

??The correct solution would be to use  pairs as keys in the 
Selector job, with a partitioner and secondary sorting so that all keys with 
same host end up in the same call of the reducer. If values can also hold a 
HostDb entry and the sort comparator guarantees that the HostDb entry (entries 
if partitioned by domain or IP) comes in front of all CrawlDb entries. But that 
would be a substantial improvement...??

  was:??The correct solution would be to use  pairs as keys in the 
Selector job, with a partitioner and secondary sorting so that all keys with 
same host end up in the same call of the reducer. If values can also hold a 
HostDb entry and the sort comparator guarantees that the HostDb entry (entries 
if partitioned by domain or IP) comes in front of all CrawlDb entries. But that 
would be a substantial improvement...??


> Speed up the merging of HostDb entries for variable fetch delay
> ---
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use  pairs as keys in the 
> Selector job, with a partitioner and secondary sorting so that all keys with 
> same host end up in the same call of the reducer. If values can also hold a 
> HostDb entry and the sort comparator guarantees that the HostDb entry 
> (entries if partitioned by domain or IP) comes in front of all CrawlDb 
> entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

2017-11-06 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2455:


 Summary: Speed up the merging of HostDb entries for variable fetch 
delay
 Key: NUTCH-2455
 URL: https://issues.apache.org/jira/browse/NUTCH-2455
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.13
Reporter: Markus Jelsma


??The correct solution would be to use  pairs as keys in the 
Selector job, with a partitioner and secondary sorting so that all keys with 
same host end up in the same call of the reducer. If values can also hold a 
HostDb entry and the sort comparator guarantees that the HostDb entry (entries 
if partitioned by domain or IP) comes in front of all CrawlDb entries. But that 
would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2420.
--
Resolution: Fixed

remote:517dbdf..6199492  6199492f5e1e8811022257c88dbf63f1e1c739d0 -> master


> Bug in variable generate.max.count and fetcher.server.delay
> ---
>
> Key: NUTCH-2420
> URL: https://issues.apache.org/jira/browse/NUTCH-2420
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2420.patch
>
>
> Feature added by NUTCH-2368 does not work for multiple hosts. Once a 
> HostDatum has been read by getHostDatum(), the next host cannot be read. 
> Apparantly i need to open and close the SequenceFile.Readers for every 
> HostDatum it needs. Reader has no reset() method or whatsoever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2420) Bug in variable generate.max.count and fetcher.server.delay

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240416#comment-16240416
 ] 

Sebastian Nagel commented on NUTCH-2420:


That's really ugly! It will work in local mode or on a small cluster with not 
too many hosts, but on a larger cluster a HostDb with many partitions and 
millions of hosts, it will slow down of Generator significantly. The correct 
solution would be to use  pairs as keys in the Selector job, with a 
partitioner and secondary sorting so that all keys with same host end up in the 
same call of the reducer. If values can also hold a HostDb entry and the sort 
comparator guarantees that the HostDb entry (entries if partitioned by domain 
or IP) comes in front of all CrawlDb entries. But that would be a substantial 
improvement...

+1 to commit this fix and to open a new one to speed up the merging of HostDb 
entries

> Bug in variable generate.max.count and fetcher.server.delay
> ---
>
> Key: NUTCH-2420
> URL: https://issues.apache.org/jira/browse/NUTCH-2420
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2420.patch
>
>
> Feature added by NUTCH-2368 does not work for multiple hosts. Once a 
> HostDatum has been read by getHostDatum(), the next host cannot be read. 
> Apparantly i need to open and close the SequenceFile.Readers for every 
> HostDatum it needs. Reader has no reset() method or whatsoever.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2422) Update information about git repository

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240316#comment-16240316
 ] 

Sebastian Nagel commented on NUTCH-2422:


Nutch is a [Github dual master|https://gitbox.apache.org/] repository the 
"mirrored from" link should point to 
https://gitbox.apache.org/repos/asf/nutch.git. Opened INFRA-15451 to address 
this problem. The information in the wiki 
([UsingGit|https://wiki.apache.org/nutch/UsingGit]) (linked from 
http://nutch.apache.org/version_control.html) points to gitbox.

> Update information about git repository
> ---
>
> Key: NUTCH-2422
> URL: https://issues.apache.org/jira/browse/NUTCH-2422
> Project: Nutch
>  Issue Type: Task
>Reporter: Karl Richter
>
> https://github.com/apache/nutch claims to be a mirror of 
> git://git.apache.org/nutch.git, but cloning fails due to `fatal: remote 
> error: access denied or repository not exported: /nutch.git` and nutch isn't 
> listed on https://git.apache.org/. That status isn't consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2431) Filterchecker to implement Tool-interface

2017-11-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240277#comment-16240277
 ] 

Sebastian Nagel commented on NUTCH-2431:


+1 but
- could add {{-D...}} or {{-Dproperty=value}} to the command-line help to make 
clear that it's possible to override properties
- patch does not apply to master because it requires NUTCH-2320: [~jurian], 
could either decouple it or integrate the latter first and then make sure that 
the patch still applies?

> Filterchecker to implement Tool-interface
> -
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240036#comment-16240036
 ] 

ASF GitHub Bot commented on NUTCH-2442:
---

sebastian-nagel commented on a change in pull request #239: NUTCH-2442 Injector 
to stop if job fails to avoid loss of CrawlDb
URL: https://github.com/apache/nutch/pull/239#discussion_r149020659
 
 

 ##
 File path: src/java/org/apache/nutch/util/ProtocolStatusStatistics.java
 ##
 @@ -122,8 +122,17 @@ public int run(String[] args) throws Exception {
 job.setNumReduceTasks(numOfReducers);
 
 try {
-  job.waitForCompletion(true);
-} catch (Exception e) {
+  boolean success = job.waitForCompletion(true);
+  if(!success){
 
 Review comment:
   There is a missing space before opening brackets/braces (also other 
occurrences). The formatting should follow the style defined by the Eclipse 
Code Formatter rules 
(https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml). The 
easiest way is to import the rules into Eclipse (and other IDEs), in doubt, 
Eclipse allows to format the code from the command-line.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)