[hypertable-dev] Re: Problem with Hypertable 0.9.2.2 found and fixed

Schubert Zhang Wed, 25 Mar 2009 11:19:44 -0700

Rrgards this duplicated assignment issue.
In my consideration, both the interim fix and persistence fix may be
not robust.


Following MSC chart is my proposal.
I am not familar with the latest Hypertable code (I had studied
0.9.0.7), if I am wrong, please point me.

char1: success assignment case, we should design a acknowledge
mechanism.
origRS -----------------------------
Master-----------------------------------RS1
      --------split range notify------->
                                                select a RS
                                                        ------------
assign to RS1-------->
      <--------succ ack---------------     <-----------succ
ack---------------


chart2: failuer/timeout assignment case
origRS -----------------------------
Master-----------------------------------RS1-------------------RS2
      --------split range notify------->
                                                select a RS
                                                        ------------
assign to RS1-------->
                                           timeout or failed
                                                        -------retry 2
times assign ------>
                                           still timeout or failed
                                           select another RS
                                                        ------------
deassign-------->
 
----------------------- assign to another RS2--------->
                                            still timeout or failed
    <--------report failure-----------
...................(another round)...................

3. a mechanism to avoid duplicated or wrong assigment
origRS -----------------------------
Master-----------------------------------RS1
                                                       <-----------
succ ack---------------
                                             check, but find
                                             the range is in RS2
                                                        ------------
deassign------------->
                                                        <-----------
succ ack---------------


On Mar 21, 1:41 am, Doug Judd <[email protected]> wrote:
> P.S. The memory exhaustion problem will be fixed in the 0.9.2.4 release.
>
> On Fri, Mar 20, 2009 at 10:37 AM, Doug Judd <[email protected]> wrote:
> > With the help of Earle Ady, we've found and fixed the large load corruption
> > problem with the 0.9.2.2 release.  To get the fixed version, please pull the
> > latest code from the git 
> > repository<http://code.google.com/p/hypertable/wiki/SourceCode?tm=4>.
> > We'll be releasing 0.9.2.3 soon.
>
> > Here's a summary of the problem:
>
> > With the fix of issue 
> > 246<http://code.google.com/p/hypertable/issues/detail?id=246>,
> > compactions are now happening regularly as they should.  However, this has
> > added substantial load on the system.  When a range split and the master was
> > notified of the newly split-off range, the master selected (round-robin) a
> > new RangeServer to own the range.  However, due to the increased load on the
> > system and a 30 second hardcoded timeout in the Master, the
> > RangeServer::load_range() command was timing out (It was taking 32 to 37
> > seconds).  This timeout was reported back to the originating RangeServer,
> > which paused a fifteen seconds and tried it again.  But on the second
> > attempt to notify the Master of the newly split-off range, the Master would
> > (round-robin) select another RangeServer and invoke
> > RangeServer::load_range() on that (different) server.  This had the effect
> > of the same range being loaded by three different RangeServers which was
> > wreaking havoc with the system.  There were two fixes for this problem:
>
> > 1. The hardcoded timeout was removed and (almost) all timeouts in the
> > system are based on the "Hypertable.Request.Timeout" property which now has
> > a default value of 180 seconds.
>
> > 2. An interim fix was put in place in the Master where upon
> > RangeServer::load_range() failure, the Master will remember what RangeServer
> > it attmpted to do the load on.  The next time it gets notified and attempts
> > to load the same range, it will choose the same RangeServer.  If it gets an
> > error message back, RANGE_ALREADY_LOADED, it will interpret that as
> > success.  The reason this fix is interim is because it does not persist the
> > Range-to-RangeServer mapping information, so if it were to fail at an
> > inopportune time and come back up, we'd be subject to the same failure.
> > This will get fixed with Issue 74 - Master directed 
> > RangeServer<http://code.google.com/p/hypertable/issues/detail?id=79>recovery
> >  since the Master will have a meta-log and will be able to persist
> > this mapping as re-constructible state information.
>
> > After we fixed this problem, the next problem that Earle ran into was that
> > the RangeServer was exhausting memory and crashing.  To fix this, we added
> > the following property to the hypertable.cfg file on the machine that was
> > doing the LOAD DATA INFILE:
>
> > Hypertable.Lib.Mutator.FlushDelay=100
>
> > Keep this in mind if you encounter the same problem.
>
> > - Doug
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Problem with Hypertable 0.9.2.2 found and fixed

Reply via email to