[hypertable-dev] Problem with Hypertable 0.9.2.2 found and fixed

Doug Judd Fri, 20 Mar 2009 10:38:03 -0700

With the help of Earle Ady, we've found and fixed the large load corruption
problem with the 0.9.2.2 release.  To get the fixed version, please pull the
latest code from the git
repository<http://code.google.com/p/hypertable/wiki/SourceCode?tm=4>.
We'll be releasing 0.9.2.3 soon.


Here's a summary of the problem:

With the fix of issue
246<http://code.google.com/p/hypertable/issues/detail?id=246>,
compactions are now happening regularly as they should.  However, this has
added substantial load on the system.  When a range split and the master was
notified of the newly split-off range, the master selected (round-robin) a
new RangeServer to own the range.  However, due to the increased load on the
system and a 30 second hardcoded timeout in the Master, the
RangeServer::load_range() command was timing out (It was taking 32 to 37
seconds).  This timeout was reported back to the originating RangeServer,
which paused a fifteen seconds and tried it again.  But on the second
attempt to notify the Master of the newly split-off range, the Master would
(round-robin) select another RangeServer and invoke
RangeServer::load_range() on that (different) server.  This had the effect
of the same range being loaded by three different RangeServers which was
wreaking havoc with the system.  There were two fixes for this problem:

1. The hardcoded timeout was removed and (almost) all timeouts in the system
are based on the "Hypertable.Request.Timeout" property which now has a
default value of 180 seconds.

2. An interim fix was put in place in the Master where upon
RangeServer::load_range() failure, the Master will remember what RangeServer
it attmpted to do the load on.  The next time it gets notified and attempts
to load the same range, it will choose the same RangeServer.  If it gets an
error message back, RANGE_ALREADY_LOADED, it will interpret that as
success.  The reason this fix is interim is because it does not persist the
Range-to-RangeServer mapping information, so if it were to fail at an
inopportune time and come back up, we'd be subject to the same failure.
This will get fixed with Issue 74 - Master directed
RangeServer<http://code.google.com/p/hypertable/issues/detail?id=79>recovery
since the Master will have a meta-log and will be able to persist
this mapping as re-constructible state information.

After we fixed this problem, the next problem that Earle ran into was that
the RangeServer was exhausting memory and crashing.  To fix this, we added
the following property to the hypertable.cfg file on the machine that was
doing the LOAD DATA INFILE:

Hypertable.Lib.Mutator.FlushDelay=100

Keep this in mind if you encounter the same problem.

- Doug

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Problem with Hypertable 0.9.2.2 found and fixed

Reply via email to