Hi Doug,

I'm afraid there are still deeper causes of this bug. With your fix
applied, it doesn't happen that frequently as before, but still
happens after inserting some hundreds of gigabytes of data. We need to
fix this because the maintenance task is currently the bottleneck of
the Range Server.

Actually, Range Server workers can accept updates much faster than
maintenance task compacts them. This fact makes range servers
unreliable. Consider if we feed Hypertable with MapReduce tasks, very
soon range servers are all filled with over-sized ranges waiting for
compaction. The situation gets worse and worse as time goes on because
workers still accept more updates without knowing that the maintenance
tasks are seriously lagged and the memory will to be used out soon. In
fact in our application range servers die many times per week due to
out of memory, this makes the maintenance a heavy task because
Hypertable doesn't have usable auto-recovery functionality yet. To
make range servers more reliable, we need a mechanism to slow down.

On the other hand, why should compactions be handled by background
maintenance tasks? IMHO if we do compactions directly in
RangeServer::update(), a lot of trouble could be saved. It won't block
the client initiating the current update as long as a response message
could be sent before starting the compaction. Upcoming updates won't
block either because no lock is needed while doing compaction, other
workers may handle those updates. The only situation that may block
client updates is that all workers are busy doing compactions, which
is the situation when clients should definitely slow down.

What do you think?

Donald

On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
> Hi Donald,
>
> I've reproduced this problem and have checked in a fix to the 'next'
> branch.  This was introduced with the major overhaul.  I have added a
> multiple maintenance thread system test to prevent this from happening in
> the future.
>
> BTW, if you do pull the 'next' branch, it has a number of changes that make
> it incompatible with the previous versions.  You'll have to start with a
> clean database.  The 'next' branch will be compatible with 0.9.1.0 which
> should get released tomorrow.
>
> - Doug
>
> On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote:
>
> > Hi Doug,
>
> > I thinks it's better to open a new thread on this topic :)
>
> > The multiple maintenance thread crash is easy to reproduce: just set
> > Hypertable.RangeServer.MaintenanceThreads=2, start all servers locally
> > on a single node and run random_write_test 10000000000. The range
> > server will crash in a minute. But the reason is sort of hard to
> > track.
>
> > What we know till now:
> > 1. The bug is introduced in version 0.9.0.11. Former versions doesn't
> > have this problem
> > 2. According to RangeServer.log, the crash usually happens when two
> > adjacent ranges are both splitting in two maintenance threads
> > concurrently. If we forbid this behavior by modifying
> > MaintenanceTaskQueue code, the crash problem is gone, but the reason
> > is unknown. (Pheonix discovered this)
> > 3. Sometimes the Range Server fails at HT_EXPECT
> > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
> > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in
> > multiple places with m_mutex locked, but not always checked in a
> > locked environment, which is doubtable.
>
> > Do you have any idea based on these facts?
>
> > Donald
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to