[ 
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517402
 ] 

Andrzej Bialecki  commented on NUTCH-532:
-----------------------------------------

Float values were originally intended to express fractions of a day, when fetch 
interval was expressed in days, but after we changed the unit to seconds there 
is little purpose to it.

However, we need to be careful about the size of the data - long values are .. 
long ;), and for all operations that involve CrawlDatum this will have 
performance implications. Is it really useful to keep re-fetch interval in 
milliseconds? If we limit the resolution to a unit of seconds, as it is now, 
then I think an int value should be enough - which means that the 
sizeof(CrawlDatum) stays the same.

+1 on adding a getLastFetchTime, with a good javadoc that explains the formula 
and assumptions. Perhaps it should be called calculateLastFetchTime, to avoid 
misunderstandings, because in reality we don't keep that value. The method 
should be added to FetchSchedule interface, and it should be implemented in 
AbstractFetchSchedule.

Re: datum.setFetchTime - IMHO it's a premature optimization, this expression is 
used just twice in the whole code base


> CrawlDbMerger: wrong computation of last fetch time
> ---------------------------------------------------
>
>                 Key: NUTCH-532
>                 URL: https://issues.apache.org/jira/browse/NUTCH-532
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-532.patch
>
>
> CrawlDbMerger.reduce analyse the last fetch time of each record and keep the 
> more recent record.
> This comparison is based on a FetchInterval in days : resTime = 
> res.getFetchTime() - Math.round(res.getFetchInterval() * 3600 * 24 * 1000);
> It was not really a noticeable as the Math.Round method return the 
> INTEGER.MAX_VALUE i.e 25 days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to