With the help of Earle Ady, we've found and fixed the large load corruption problem with the 0.9.2.2 release. To get the fixed version, please pull the latest code from the git repository<http://code.google.com/p/hypertable/wiki/SourceCode?tm=4>. We'll be releasing 0.9.2.3 soon.
Here's a summary of the problem: With the fix of issue 246<http://code.google.com/p/hypertable/issues/detail?id=246>, compactions are now happening regularly as they should. However, this has added substantial load on the system. When a range split and the master was notified of the newly split-off range, the master selected (round-robin) a new RangeServer to own the range. However, due to the increased load on the system and a 30 second hardcoded timeout in the Master, the RangeServer::load_range() command was timing out (It was taking 32 to 37 seconds). This timeout was reported back to the originating RangeServer, which paused a fifteen seconds and tried it again. But on the second attempt to notify the Master of the newly split-off range, the Master would (round-robin) select another RangeServer and invoke RangeServer::load_range() on that (different) server. This had the effect of the same range being loaded by three different RangeServers which was wreaking havoc with the system. There were two fixes for this problem: 1. The hardcoded timeout was removed and (almost) all timeouts in the system are based on the "Hypertable.Request.Timeout" property which now has a default value of 180 seconds. 2. An interim fix was put in place in the Master where upon RangeServer::load_range() failure, the Master will remember what RangeServer it attmpted to do the load on. The next time it gets notified and attempts to load the same range, it will choose the same RangeServer. If it gets an error message back, RANGE_ALREADY_LOADED, it will interpret that as success. The reason this fix is interim is because it does not persist the Range-to-RangeServer mapping information, so if it were to fail at an inopportune time and come back up, we'd be subject to the same failure. This will get fixed with Issue 74 - Master directed RangeServer<http://code.google.com/p/hypertable/issues/detail?id=79>recovery since the Master will have a meta-log and will be able to persist this mapping as re-constructible state information. After we fixed this problem, the next problem that Earle ran into was that the RangeServer was exhausting memory and crashing. To fix this, we added the following property to the hypertable.cfg file on the machine that was doing the LOAD DATA INFILE: Hypertable.Lib.Mutator.FlushDelay=100 Keep this in mind if you encounter the same problem. - Doug --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
