This message describes how a self-adjusting fetch interval can be implemented. It presents an idea, and not Nutch source code, but it does end in a suggestion to change the datatype for the fetch interval from a one byte unsigned integer to a four byte float.
After my questions about "News harvesting" to nutch-general on June 23 and Doug's answer on June 29, I decided to learn more about what Nutch puts in its database. Doug's "this is in the database, this is not yet in the database" led me to believe the database format is an open laboratory. The lack of documentation was frustrating, so I put my notes in the wiki hoping that this might help others, http://www.nutch.org/cgi-bin/twiki/view/Main/NutchFileFormats So far I have only looked at Nutch 0.4 and I understand the format might change in 0.5, but it still seems complicated to change this format all too often. One of my applications is to keep track of when I first discovered a page, similar to what the Internet Archive does, so starting over with a clean database is not an option. Realizing that Nutch cannot immediately fill my need for a flexible search engine laboratory, I invested some time into rewriting my crontab/shell/wget script for news harvesting in Perl, and moved the fetch list (which was hard coded in the shell script) to the MySQL database that my application already uses. In this, I also implemented the self-adjustment of the fetch interval that I asked for in my nutch-general posting. I found it useful to represent the fetch interval in seconds. When fetching a page, the program determines if the page was updated. In my case, this means finding new outgoing links. I don't know if it could be useful to consider any change in the contents as an update, or if that would mean every fetch is an update. If the page was not updated, the fetch interval is increased by 20%. If it was updated, the fetch interval is reduced by 50%. After this, a constant value is added to the fetch interval, in my case 1800 seconds. In my case, a failure to fetch a page is treated as no change, and there is no upper limit to the fetch interval. This means broken or gone pages will be retrieved more and more seldom, and this seems like the sound thing to do. Here is the SQL statement, where 'changed' is 1 for true and 0 for false: update fetchlist set lastvisited = now(), fetchinterval = (1.20 - 0.70 * changed) * fetchinterval + 1800 where link = thisurl; Adding the constant has the effect of setting a floor for the fetch interval at twice the value of that constant, or 3600 seconds. If the page is updated every hour, the previous fetch interval will be cut in half but when the 1800 seconds are added, it will be back at 3600 seconds. For pages that are seldom updated, these 1800 seconds have little effect. For a page that is updated every 30 days, the fetch interval will adjust like this: fetch updated? change fetch interval ----------+-------------+------------------+-------------- 10 days day 1 1 day ago -50 % or -5 days 5 days day 6 no +20 % or +1 days 6 days day 12 no +20 % or +1.2 days 7.2 days day 19.2 no +20 % or +1.4 days 8.6 days day 27.8 no +20 % or +1.7 days 10.3 days day 38.1 8.1 days ago -50 % or -5.1 days 5.2 days day 43.3 no +20 % or +1 days 6.2 days day 49.5 no +20 % or +1.2 days 7.4 days day 56.9 no +20 % or +1.5 days 8.9 days day 65.8 5.8 days ago -50 % or -4.4 days 4.5 days day 70.3 no +20 % or +0.9 days 5.4 days day 75.7 no +20 % or +1.1 days 6.5 days day 82.2 no +20 % or +1.3 days 7.8 days day 90.0 0.0 days ago -50 % or -3.9 days 3.9 days day 93.9 no +20 % or +0.8 days 4.7 days day 98.6 no +20 % or +0.9 days 5.6 days day 104.2 no +20 % or +1.1 days 6.7 days day 110.9 no +20 % or +1.3 days 8.0 days day 118.9 no +20 % or +1.6 days 9.6 days day 128.5 8.5 days ago -50 % or -4.8 days 4.8 days ----------+-------------+------------------+-------------- In this example where the page is updated every 30 days, it will be fetched five times more often and the delay from update to fetch is less than nine days or one third of the update interval. This algorithm seems to me like an OK compromise between fast news discovery and economic fetching. The algorithm requires a minimum of stored historic information (it's all in the fetch interval) and computational complexity. However, I doubt that the one byte value that Nutch 0.4 uses for the fetch interval (in days) is optimal. If the database format is about to change in version 0.5, a change of the fetch interval to a 4 byte short float should be considered. -- Lars Aronsson ([EMAIL PROTECTED]) Aronsson Datateknik - http://aronsson.se/ ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
