This message describes how a self-adjusting fetch interval can be
implemented.  It presents an idea, and not Nutch source code, but it
does end in a suggestion to change the datatype for the fetch interval
from a one byte unsigned integer to a four byte float.

After my questions about "News harvesting" to nutch-general on June 23
and Doug's answer on June 29, I decided to learn more about what Nutch
puts in its database.  Doug's "this is in the database, this is not
yet in the database" led me to believe the database format is an open
laboratory.  The lack of documentation was frustrating, so I put my
notes in the wiki hoping that this might help others,
http://www.nutch.org/cgi-bin/twiki/view/Main/NutchFileFormats

So far I have only looked at Nutch 0.4 and I understand the format
might change in 0.5, but it still seems complicated to change this
format all too often.  One of my applications is to keep track of when
I first discovered a page, similar to what the Internet Archive does,
so starting over with a clean database is not an option.

Realizing that Nutch cannot immediately fill my need for a flexible
search engine laboratory, I invested some time into rewriting my
crontab/shell/wget script for news harvesting in Perl, and moved the
fetch list (which was hard coded in the shell script) to the MySQL
database that my application already uses.  In this, I also
implemented the self-adjustment of the fetch interval that I asked for
in my nutch-general posting.

I found it useful to represent the fetch interval in seconds.  When
fetching a page, the program determines if the page was updated.  In
my case, this means finding new outgoing links.  I don't know if it
could be useful to consider any change in the contents as an update,
or if that would mean every fetch is an update.  If the page was not
updated, the fetch interval is increased by 20%.  If it was updated,
the fetch interval is reduced by 50%.  After this, a constant value is
added to the fetch interval, in my case 1800 seconds.  In my case, a
failure to fetch a page is treated as no change, and there is no upper
limit to the fetch interval.  This means broken or gone pages will be
retrieved more and more seldom, and this seems like the sound thing to
do.  Here is the SQL statement, where 'changed' is 1 for true and 0
for false:

  update fetchlist set
    lastvisited = now(),
    fetchinterval = (1.20 - 0.70 * changed) * fetchinterval + 1800
  where link = thisurl;

Adding the constant has the effect of setting a floor for the fetch
interval at twice the value of that constant, or 3600 seconds.  If the
page is updated every hour, the previous fetch interval will be cut in
half but when the 1800 seconds are added, it will be back at 3600
seconds.  For pages that are seldom updated, these 1800 seconds have
little effect.

For a page that is updated every 30 days, the fetch interval will
adjust like this:

   fetch      updated?      change             fetch interval
   ----------+-------------+------------------+--------------
                                                10   days
   day   1    1 day ago     -50 % or -5 days     5   days
   day   6    no            +20 % or +1 days     6   days
   day  12    no            +20 % or +1.2 days   7.2 days
   day  19.2  no            +20 % or +1.4 days   8.6 days
   day  27.8  no            +20 % or +1.7 days  10.3 days
   day  38.1  8.1 days ago  -50 % or -5.1 days   5.2 days
   day  43.3  no            +20 % or +1 days     6.2 days
   day  49.5  no            +20 % or +1.2 days   7.4 days
   day  56.9  no            +20 % or +1.5 days   8.9 days
   day  65.8  5.8 days ago  -50 % or -4.4 days   4.5 days
   day  70.3  no            +20 % or +0.9 days   5.4 days
   day  75.7  no            +20 % or +1.1 days   6.5 days
   day  82.2  no            +20 % or +1.3 days   7.8 days
   day  90.0  0.0 days ago  -50 % or -3.9 days   3.9 days
   day  93.9  no            +20 % or +0.8 days   4.7 days
   day  98.6  no            +20 % or +0.9 days   5.6 days
   day 104.2  no            +20 % or +1.1 days   6.7 days
   day 110.9  no            +20 % or +1.3 days   8.0 days
   day 118.9  no            +20 % or +1.6 days   9.6 days
   day 128.5  8.5 days ago  -50 % or -4.8 days   4.8 days
   ----------+-------------+------------------+--------------

In this example where the page is updated every 30 days, it will be
fetched five times more often and the delay from update to fetch is
less than nine days or one third of the update interval.

This algorithm seems to me like an OK compromise between fast news
discovery and economic fetching.  The algorithm requires a minimum of
stored historic information (it's all in the fetch interval) and
computational complexity.  However, I doubt that the one byte value
that Nutch 0.4 uses for the fetch interval (in days) is optimal. If
the database format is about to change in version 0.5, a change of the
fetch interval to a 4 byte short float should be considered.


-- 
  Lars Aronsson ([EMAIL PROTECTED])
  Aronsson Datateknik - http://aronsson.se/



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to