[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-21 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519500#comment-16519500
 ] 

Hudson commented on NUTCH-2565:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3537 (See 
[https://builds.apache.org/job/Nutch-trunk/3537/])
- fix for NUTCH-2565 contributed by Jurian Broertjes (jurian.broertjes: 
[https://github.com/apache/nutch/commit/415dc90326455e2c8b9ba8a4c8612b2e451e8662])
* (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java


> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.15
>
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519443#comment-16519443
 ] 

ASF GitHub Bot commented on NUTCH-2565:
---

sebastian-nagel closed pull request #311: - fix for NUTCH-2565 contributed by 
Jurian Broertjes
URL: https://github.com/apache/nutch/pull/311
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java 
b/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
index 38fde9f02..a3209894b 100755
--- a/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
+++ b/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
@@ -157,7 +157,11 @@ public CrawlDatum setPageRetrySchedule(Text url, 
CrawlDatum datum,
* @return the date as a long.
*/
   public long calculateLastFetchTime(CrawlDatum datum) {
-return datum.getFetchTime() - (long) datum.getFetchInterval() * 1000;
+if (datum.getStatus() == CrawlDatum.STATUS_DB_UNFETCHED) {
+  return 0L;
+} else {
+  return datum.getFetchTime() - (long) datum.getFetchInterval() * 1000;
+}
   }
 
   /**
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java 
b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
index d8756fd5e..475ee855d 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
@@ -31,6 +31,7 @@
 import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.MapWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.mapreduce.Job;
@@ -70,8 +71,6 @@
 
   public static class Merger extends
   Reducer {
-private org.apache.hadoop.io.MapWritable meta;
-private CrawlDatum res = new CrawlDatum();
 private FetchSchedule schedule;
 
 public void close() throws IOException {
@@ -85,40 +84,40 @@ public void setup(Reducer.Context context) {
 public void reduce(Text key, Iterable values,
 Context context)
 throws IOException, InterruptedException {
-  long resTime = 0L;
-  boolean resSet = false;
-  long valTime = 0L;
-  meta = new org.apache.hadoop.io.MapWritable();
+
+  CrawlDatum res = new CrawlDatum();
+  res.setFetchTime(-1); // We want everything to be newer!
+  MapWritable meta = new MapWritable();
+
   for (CrawlDatum val : values) {
-if (!resSet) {
-  res.set(val);
-  resSet = true;
-  resTime = schedule.calculateLastFetchTime(res);
-  for (Entry e : res.getMetaData().entrySet()) {
-meta.put(e.getKey(), e.getValue());
-  }
-  continue;
-}
-// compute last fetch time, and pick the latest
-valTime = schedule.calculateLastFetchTime(val);
-if (valTime > resTime) {
+if (isNewer(res, val)) {
   // collect all metadata, newer values override older values
-  for (Entry e : val.getMetaData().entrySet()) {
-meta.put(e.getKey(), e.getValue());
-  }
+  meta = mergeMeta(val.getMetaData(), meta);
   res.set(val);
-  resTime = valTime;
 } else {
-  // insert older metadata before newer
-  for (Entry e : meta.entrySet()) {
-val.getMetaData().put(e.getKey(), e.getValue());
-  }
-  meta = val.getMetaData();
+  // overwrite older metadata with current metadata
+  meta = mergeMeta(meta, val.getMetaData());
 }
   }
+
   res.setMetaData(meta);
   context.write(key, res);
 }
+
+// Determine which CrawlDatum is the latest, according to 
calculateLastFetchTime() 
+// and getFetchTime() as fallback in case calculateLastFetchTime()s are 
equal (eg: DB_UNFETCHED)
+private boolean isNewer(CrawlDatum cd1, CrawlDatum cd2) {
+  return schedule.calculateLastFetchTime(cd2) > 
schedule.calculateLastFetchTime(cd1) 
+|| schedule.calculateLastFetchTime(cd2) == 
schedule.calculateLastFetchTime(cd1) 
+&& cd2.getFetchTime() > cd1.getFetchTime();
+}
+
+private MapWritable mergeMeta(MapWritable from, MapWritable to) {
+  for (Entry e : from.entrySet()) {
+to.put(e.getKey(), e.getValue());
+  }
+  return to;
+}
   }
 
   public CrawlDbMerger() {


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
u

[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-14 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512589#comment-16512589
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

Updated PR with the proposed solution

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509801#comment-16509801
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

Maybe it would be sufficient to only test on STATUS_DB_UNFETCHED in 
calculateLastFetchTime(datum), but fallback on CrawlDatum.getFetchTime() in the 
merger and pick the newest according to that.

That way we could also just pick the retries value from the newest one and keep 
it simple.

I'll add a PR later for review

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509596#comment-16509596
 ] 

Sebastian Nagel commented on NUTCH-2565:


I thought first about making the condition in calculateLastFetchTime(datum) 
more strict:
{code}
if (datum.getStatus() == CrawlDatum.STATUS_DB_UNFETCHED && 
datum.getRetriesSinceFetch() == 0) {
  return 0L;
{code}
This will guarantee that we do not prefer an older DB_FETCHED over the newer 
DB_UNFETCHED with a "transient" failure. If there are two DB_UNFETCHED with 
retries > 0 to be merged, it's important that
# the fetch time is the latest (for scheduling)
# yes, we could sum the retry counts but then we need also to trigger a status 
change if retries > db.fetch.retry.max. We need also make sure not to cause a 
retry counter overflow (it's only a signed byte) if many CrawlDbs are merged. 
In short, for me this looks too complex. What do you mean?

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-06-12 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509573#comment-16509573
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

One solution would be to sum the retries of both CrawlDatums. We could do this 
only for db_unfetched or for others aswell. What do you think would be 
appropriate?

 

 

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-04-10 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432456#comment-16432456
 ] 

Sebastian Nagel commented on NUTCH-2565:


Ok, good point. I think the CrawlDbMerger was never considered to merge an 
unfetched CrawlDb.

What about CrawlDatums with status db_unfetched and retries > 0 ?  Shouldn't 
failed fetches also take precedence over newly injected items?

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

2018-04-10 Thread Jurian Broertjes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432083#comment-16432083
 ] 

Jurian Broertjes commented on NUTCH-2565:
-

PR: https://github.com/apache/nutch/pull/311

> MergeDB incorrectly handles unfetched CrawlDatums
> -
>
> Key: NUTCH-2565
> URL: https://issues.apache.org/jira/browse/NUTCH-2565
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Jurian Broertjes
>Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)