Hi Doug, I'm afraid there are still deeper causes of this bug. With your fix applied, it doesn't happen that frequently as before, but still happens after inserting some hundreds of gigabytes of data. We need to fix this because the maintenance task is currently the bottleneck of the Range Server.
Actually, Range Server workers can accept updates much faster than maintenance task compacts them. This fact makes range servers unreliable. Consider if we feed Hypertable with MapReduce tasks, very soon range servers are all filled with over-sized ranges waiting for compaction. The situation gets worse and worse as time goes on because workers still accept more updates without knowing that the maintenance tasks are seriously lagged and the memory will to be used out soon. In fact in our application range servers die many times per week due to out of memory, this makes the maintenance a heavy task because Hypertable doesn't have usable auto-recovery functionality yet. To make range servers more reliable, we need a mechanism to slow down. On the other hand, why should compactions be handled by background maintenance tasks? IMHO if we do compactions directly in RangeServer::update(), a lot of trouble could be saved. It won't block the client initiating the current update as long as a response message could be sent before starting the compaction. Upcoming updates won't block either because no lock is needed while doing compaction, other workers may handle those updates. The only situation that may block client updates is that all workers are busy doing compactions, which is the situation when clients should definitely slow down. What do you think? Donald On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote: > Hi Donald, > > I've reproduced this problem and have checked in a fix to the 'next' > branch. This was introduced with the major overhaul. I have added a > multiple maintenance thread system test to prevent this from happening in > the future. > > BTW, if you do pull the 'next' branch, it has a number of changes that make > it incompatible with the previous versions. You'll have to start with a > clean database. The 'next' branch will be compatible with 0.9.1.0 which > should get released tomorrow. > > - Doug > > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote: > > > Hi Doug, > > > I thinks it's better to open a new thread on this topic :) > > > The multiple maintenance thread crash is easy to reproduce: just set > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers locally > > on a single node and run random_write_test 10000000000. The range > > server will crash in a minute. But the reason is sort of hard to > > track. > > > What we know till now: > > 1. The bug is introduced in version 0.9.0.11. Former versions doesn't > > have this problem > > 2. According to RangeServer.log, the crash usually happens when two > > adjacent ranges are both splitting in two maintenance threads > > concurrently. If we forbid this behavior by modifying > > MaintenanceTaskQueue code, the crash problem is gone, but the reason > > is unknown. (Pheonix discovered this) > > 3. Sometimes the Range Server fails at HT_EXPECT > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in > > multiple places with m_mutex locked, but not always checked in a > > locked environment, which is doubtable. > > > Do you have any idea based on these facts? > > > Donald --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
