[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Doug Judd Wed, 10 Sep 2008 19:46:48 -0700

Hi Josh,

No problem.  BTW, you might actually lose some data with this approach.  It
looks like data may not have come in strictly increasing order of row key
(as can be seen by the sizes of some of the range directories).  You'll
probably lose some of the data that was submitted out of order.


- Doug

On Wed, Sep 10, 2008 at 5:18 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:

> Thanks Doug, I'll give it a try.
>
>
> On Wed, Sep 10, 2008 at 4:53 PM, Doug Judd <[EMAIL PROTECTED]> wrote:
>
>> Hi Josh,
>>
>> If you're just trying to get the system up and running and don't mind if
>> you potentially lose some data, you could try this.  Do a directory listing
>> in the /hypertable/tables/X/default/AB2A0D28DE6B77FFDD6C72AF directory
>> and find the newest CellStore file csNNNN and remember the creation time t.
>> Then, in the log/user/ directory of the server that is handling all of the
>> load, delete all of the log fragments that have a creation time that is less
>> than t.  I think that should actually work without data loss.
>>
>> - Doug
>>
>>
>> On Wed, Sep 10, 2008 at 4:37 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:
>>
>>> I'm still trying to get my Hypertable cluster running again.  After
>>> seeing half the RangeServers die because they lost their Hyperspace session
>>> when loading with 5 concurrent clients, I decided to take Donald's advice
>>> and give the Master processes (Hypertable+Hyperspace) a dedicated node.
>>> Then I tried restarting the failed RangeServers.  This time the one with the
>>> 350+ GB of commit logs spent 6 hours trying to recover before I noticed it
>>> had grown to 15 GB of memory (7 GB RSS).  I shot it down since it was just
>>> thrashing at that point.
>>>
>>> So now I seem to have two problems:
>>>
>>> 1) Log cleanup doesn't seem to be working so I have to replay 350+ GB
>>> when I restart.
>>>
>>> 2) When replaying the logs, I run out of memory.
>>>
>>> I've been trying to figure out #2, since I can no longer keep the servers
>>> running long enough to address #1.  It looks like all compactions are
>>> deferred until the recovery is done.  Commits get loaded into memory until
>>> the machine runs out, then boom.  I don't have the best understanding of the
>>> recovery strategy, but I'd guess that fixing this problem would require some
>>> major surgery.
>>>
>>> One argument is that #2 isn't worth fixing.  If #1 were working properly,
>>> the system wouldn't get itself into such a bad state.  The recovery can just
>>> assume there's enough memory most of the time.
>>>
>>> Most of the time is not all of the time, though.  I can imagine some
>>> normal use cases where this problem would pop up:
>>>
>>> A) One server is falling behind on compactions due to hardware issues or
>>> resource contention and it eventually crashes for lack of memory.  When
>>> another server comes up to recover, it has to recover the same memory load
>>> that just caused the last process to crash.
>>>
>>> B) Cluster management software decides to take a RangeServer machine out
>>> of service.  Say it's a machine with 8 GB of RAM and Hypertable has buffered
>>> up 5 GB in memory.  It doesn't get a chance to compact before being taken
>>> down.  The machine chosen as a replacement server only has 4 GB of of
>>> available RAM.  It will somehow have to recover the 5 GB memory state of the
>>> old server.
>>>
>>> Maybe these are post-1.0 concerns.  I'm wondering what I can do now.  The
>>> "solution" I'm looking at is to wipe out my entire Hypertable installation
>>> and try to isolate #1 from a clean slate.  Any suggestions for a less
>>> drastic fix?
>>>
>>> Josh
>>>
>>>
>>>
>>> On Mon, Sep 8, 2008 at 11:38 AM, Luke <[EMAIL PROTECTED]> wrote:
>>>
>>>>
>>>> Maybe we should consider an option to split off (moving to another
>>>> range server) lower/higher half of a range, depending on the loading
>>>> pattern of the data. The range server can dynamically detect if the
>>>> row key is in ascending order and split off the higher half range, or
>>>> vice versa, to balance the data better (it's better than rebalancing
>>>> the data later, as it involves extra copies.)
>>>>
>>>> __Luke
>>>>
>>>> On Sep 8, 9:14 am, "Doug Judd" <[EMAIL PROTECTED]> wrote:
>>>> > Hi Josh,
>>>> >
>>>> > The problem here is that this particular workload (loading data in
>>>> ascending
>>>> > order of primary key) is worst-case from Hypertable's perspective.  It
>>>> works
>>>> > optimally with random updates that are uniform across the primary key
>>>> space.
>>>> >
>>>> > The way the system works is that a single range server ends up
>>>> handling all
>>>> > of the load.  When a range fills up and splits, the lower half will
>>>> get
>>>> > re-assigned to another range server.  However, since there will be no
>>>> more
>>>> > updates to that lower half, there will be not activity on that range.
>>>>  When
>>>> > a range splits, it first does a major compaction.  After the split,
>>>> both
>>>> > ranges (lower half and upper half), will share the same CellStore file
>>>> in
>>>> > the DFS.  This is why you see 4313 range directories that are empty
>>>> (their
>>>> > key/value pairs are inside a CellStore file that is shared with range
>>>> > AB2A0D28DE6B77FFDD6C72AF and is inside this range's directory).  So,
>>>> the
>>>> > ranges are getting round-robin assigned to all of the RangeServers,
>>>> it's
>>>> > just that the RangeServer that holds range AB2A0D28DE6B77FFDD6C72AF is
>>>> doing
>>>> > all of the work.
>>>> >
>>>> > There is probably a bug that is preventing the Commit log from getting
>>>> > garbage collected in this scenario.  I have a couple of high priority
>>>> things
>>>> > on my stack right now, so I probably won't get to it until later this
>>>> week
>>>> > or early next week.  If you have any time to investigate, the place to
>>>> look
>>>> > would be RangeServer::log_cleanup().  This method gets called once per
>>>> > minute to do log fragment garbage collection.
>>>> >
>>>> > Also, this workload seems like it is more common than we initially
>>>> > expected.  In fact, it is the same workload that we here at Zvents see
>>>> in
>>>> > our production log processing deployment.  We should definitely spend
>>>> some
>>>> > time optimizing Hypertable for this type of workload.
>>>> >
>>>> > - Doug
>>>> >
>>>> > On Sun, Sep 7, 2008 at 1:11 PM, Joshua Taylor <
>>>> [EMAIL PROTECTED]>wrote:
>>>> >
>>>> > > Hi Donald,
>>>> >
>>>> > > Thanks for the insights!  That's interesting that the server has so
>>>> many
>>>> > > ranges loaded on it.  Does Hypertable not yet redistribute ranges
>>>> for
>>>> > > balancing?
>>>> >
>>>> > > Looking in /hypertable/tables/X/default/, I see 4313 directories,
>>>> which I
>>>> > > guess correspond to the ranges.  If what you're saying is true, then
>>>> that
>>>> > > one server has all the ranges.  When I was looking at the METADATA
>>>> table
>>>> > > earlier, I seem to remember that ranges seemed to be spread around
>>>> as far as
>>>> > > the METADATA table was concerned.  I can't verify that now because
>>>> half of
>>>> > > the RangeServers in the cluster went down after I tried the 15-way
>>>> load last
>>>> > > night.  Maybe these log directories indicate that each range was
>>>> created on
>>>> > > this one server, but isn't necessarily still hosted there.
>>>> >
>>>> > > Looking in table range directories, I see that most of them are
>>>> empty.  Of
>>>> > > the 4313 table range directories only 12 have content, with the
>>>> following
>>>> > > size distribution:
>>>> >
>>>> > > Name                     Size in bytes
>>>> > > 71F33965BA815E48705DB484 772005
>>>> > > D611DD0EE66B8CF9FB4AA997 40917711
>>>> > > 38D1E3EA8AD2F6D4BA9A4DF8 74199178
>>>> > > AB2A0D28DE6B77FFDD6C72AF 659455660576
>>>> > > 4F07C111DD9998285C68F405 900
>>>> > > F449F89DDE481715AE83F46C 29046097
>>>> > > 1A0950A7883F9AC068C6B5FD 54621737
>>>> > > 9213BEAADBFF69E633617D98 900
>>>> > > 6224D36D9A7D3C5B4AE941B2 131677668
>>>> > > 6C33339858EDF470B771637C 132973214
>>>> > > 64365528C0D82ED25FC7FFB0 170159530
>>>> > > C874EFC44725DB064046A0FF 900
>>>> >
>>>> > > It's really skewed, but maybe this isn't a big deal.  I'm going to
>>>> guess
>>>> > > that the 650 GB slice corresponds to the end range of the table.
>>>>  Most of
>>>> > > the data gets created here.  When a split happens, the new range
>>>> holds a
>>>> > > reference to the files in the original range and never has the need
>>>> to do a
>>>> > > compaction into its own data space.
>>>> >
>>>> > > As for the log recovery process...  when I wrote the last message,
>>>> the
>>>> > > recovery was still happening and had been running for 115 minutes.
>>>>  I let it
>>>> > > continue to run to see if it would actually finish, and it did.
>>>>  Looking at
>>>> > > the log, it appears that it actually took around 180 minutes to
>>>> complete and
>>>> > > get back to the outstanding scanner request, which had long since
>>>> timed
>>>> > > out.  After the recovery, the server is back up to 2.8 GB of memory.
>>>>  The
>>>> > > log directory still contains the 4300+ split directories, and the
>>>> user
>>>> > > commit log directory still contains 350+ GB of data.
>>>> >
>>>> > > You suggest that the log data is supposed to be cleaned up.  I'm
>>>> using a
>>>> > > post-0.9.0.10 build (v0.9.0.10-14-g50e5f71 to be exact).  It
>>>> contains what I
>>>> > > think is the patch you're referencing:
>>>> > > commit 38bbfd60d1a52aff3230dea80aa4f3c0c07daae4
>>>> > > Author: Donald <[EMAIL PROTECTED]>
>>>> > >     Fixed a bug in RangeServer::schedule_log_cleanup_compactions
>>>> that
>>>> > > prevents log cleanup com...
>>>> >
>>>> > > I'm hoping the maintenance task threads weren't too busy for this
>>>> workload,
>>>> > > as it was pretty light.  This is a 15 server cluster with a single
>>>> active
>>>> > > client writing to the table and nobody reading from the table.  Like
>>>> I said
>>>> > > earlier, I tried a 15-way write after the recovery completed and
>>>> half the
>>>> > > RangeServers died.  It looks like they all lost their Hyperspace
>>>> lease, and
>>>> > > the Hyperspace.master machine was 80% in the iowait state with an
>>>> load
>>>> > > average of 20 for a while.  The server hosts a HDFS data node, a
>>>> > > RangeServer, and Hyperspace.master.  Maybe Hyperspace.master needs a
>>>> > > dedicated server?  I should probably take that issue to another
>>>> thread.
>>>> >
>>>> > > I'll look into it further, probably tomorrow.
>>>> >
>>>> > > Josh
>>>> >
>>>> > > On Sat, Sep 6, 2008 at 9:29 PM, Liu Kejia(Donald) <
>>>> [EMAIL PROTECTED]>wrote:
>>>> >
>>>> > >> Hi Josh,
>>>> >
>>>> > >> The 4311 directories are for split logs, they are used while a
>>>> range
>>>> > >> is splitting into two. This indicates at least you have 4K+ ranges
>>>> on
>>>> > >> that server, which is pretty big (I usually have several hundreds
>>>> per
>>>> > >> server). The 3670 files are commit log files, I think it's actually
>>>> > >> quite good performance to take 115 minutes to replay a total of
>>>> 3.5G
>>>> > >> logs, you get 50MB/s throughput anyway. The problem is many of
>>>> these
>>>> > >> commit log files should be removed over time, after compactions of
>>>> the
>>>> > >> ranges take place. Ideally you'll only have 1 or 2 of these files
>>>> left
>>>> > >> after all the maintenance tasks are done. If so, the replay process
>>>> > >> only costs several seconds.
>>>> >
>>>> > >> One reason why the commit log files are not getting reclaimed is
>>>> due
>>>> > >> to a bug in the range server code, I've pushed out a fix for it and
>>>> it
>>>> > >> should be included in the latest 0.9.0.10 release. Another reason
>>>> > >> could be that your maintenance task threads are too busy to get the
>>>> > >> work done in time, you may try to increase the number of
>>>> maintenance
>>>> > >> tasks by setting Hypertable.RangeServer.MaintenanceThreads in your
>>>> > >> hypertable.cfg file.
>>>> >
>>>> > >> About load balance, I think your guess is right. About HDFS, it
>>>> seems
>>>> > >> HDFS always tries to put one copy of the file block on the local
>>>> > >> datanode. This has good performance, but certainly bad load balance
>>>> if
>>>> > >> you keep writing from one server.
>>>> >
>>>> > >> Donald
>>>> >
>>>> > >> On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor <
>>>> [EMAIL PROTECTED]>
>>>> > >> wrote:
>>>> > >> > I had a RangeServer process that was taking up around 5.8 GB of
>>>> memory
>>>> > >> so I
>>>> > >> > shot it down and restarted it.  The RangeServer has spent the
>>>> last 80
>>>> > >> > CPU-minutes (>115 minutes on the clock) in local_recover().  Is
>>>> this
>>>> > >> normal?
>>>> >
>>>> > >> > Looking around HDFS, I see around 3670 files in server's
>>>> /.../log/user/
>>>> > >> > directory, most of which are around 100 MB in size (total
>>>> directory
>>>> > >> size:
>>>> > >> > 351,031,700,665 bytes).  I also see 4311 directories in the
>>>> parent
>>>> > >> > directory, of which 4309 are named with a 24 character hex
>>>> string.  Spot
>>>> > >> > inspection of these shows that most (all?) of these contain a
>>>> single 0
>>>> > >> byte
>>>> > >> > file named "0".
>>>> >
>>>> > >> > The RangeServer log file since the restart currently contains
>>>> over
>>>> > >> 835,000
>>>> > >> > lines.  The bulk seems to be lines like:
>>>> >
>>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>>> >
>>>> > >>
>>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>>> > >> > replay_update - length=30
>>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>>> >
>>>> > >>
>>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>>> > >> > replay_update - length=30
>>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>>> >
>>>> > >>
>>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>>> > >> > replay_update - length=30
>>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>>> >
>>>> > >>
>>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>>> > >> > replay_update - length=30
>>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>>> >
>>>> > >>
>>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>>> > >> > replay_update - length=30
>>>> >
>>>> > >> > The memory usage may be the same issue that Donald was reporting
>>>> earlier
>>>> > >> in
>>>> > >> > his discussion of fragmentation.  The new RangeServer process has
>>>> grown
>>>> > >> up
>>>> > >> > to 1.5 GB of memory again, but the max cache size is 200 MB
>>>> (default).
>>>> >
>>>> > >> > I'd been loading into a 15-node Hypertable cluster all week using
>>>> a
>>>> > >> single
>>>> > >> > loader process.  I'd loaded about 5 billion cells, or around 1.5
>>>> TB of
>>>> > >> data
>>>> > >> > before I decided to kill the loader because it was taking too
>>>> long (and
>>>> > >> that
>>>> > >> > one server was getting huge).  The total data set size is around
>>>> 3.5 TB
>>>> > >> and
>>>> > >> > it took under a week to generate the original set (using 15-way
>>>> > >> parallelism,
>>>> > >> > not just a single loader), so I decided to trying to load the
>>>> rest in a
>>>> > >> > distributed manner.
>>>> >
>>>> > >> > The loading was happening in ascending row order.  It seems like
>>>> all of
>>>> > >> the
>>>> > >> > loading was happening on the same server.  I'm guessing that when
>>>> splits
>>>> > >> > happened, the low range got moved off, and the same server
>>>> continued to
>>>> > >> load
>>>> > >> > the end range.  That might explain why one server was getting all
>>>> the
>>>> > >> > traffic.
>>>> >
>>>> > >> > Looking at HDFS disk usage, the loaded server has 954 GB of disk
>>>> used
>>>> > >> for
>>>> > >> > Hadoop and the other 14 all have around 140 GB of disk usage.
>>>>  This
>>>> > >> behavior
>>>> > >> > also has me wondering what happens when that one machine fills up
>>>> > >> (another
>>>> > >> > couple hundred GB).  Does the whole system crash, or does HDFS
>>>> get
>>>> > >> smarter
>>>> > >> > about balancing?
>>>> >
>>>> > ...
>>>> >
>>>> > read more »
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Reply via email to