[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Liu Kejia(Donald) Sat, 06 Sep 2008 21:29:52 -0700

Hi Josh,

The 4311 directories are for split logs, they are used while a range
is splitting into two. This indicates at least you have 4K+ ranges on
that server, which is pretty big (I usually have several hundreds per
server). The 3670 files are commit log files, I think it's actually
quite good performance to take 115 minutes to replay a total of 3.5G
logs, you get 50MB/s throughput anyway. The problem is many of these
commit log files should be removed over time, after compactions of the
ranges take place. Ideally you'll only have 1 or 2 of these files left
after all the maintenance tasks are done. If so, the replay process
only costs several seconds.


One reason why the commit log files are not getting reclaimed is due
to a bug in the range server code, I've pushed out a fix for it and it
should be included in the latest 0.9.0.10 release. Another reason
could be that your maintenance task threads are too busy to get the
work done in time, you may try to increase the number of maintenance
tasks by setting Hypertable.RangeServer.MaintenanceThreads in your
hypertable.cfg file.

About load balance, I think your guess is right. About HDFS, it seems
HDFS always tries to put one copy of the file block on the local
datanode. This has good performance, but certainly bad load balance if
you keep writing from one server.

Donald

On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor <[EMAIL PROTECTED]> wrote:
> I had a RangeServer process that was taking up around 5.8 GB of memory so I
> shot it down and restarted it.  The RangeServer has spent the last 80
> CPU-minutes (>115 minutes on the clock) in local_recover().  Is this normal?
>
> Looking around HDFS, I see around 3670 files in server's /.../log/user/
> directory, most of which are around 100 MB in size (total directory size:
> 351,031,700,665 bytes).  I also see 4311 directories in the parent
> directory, of which 4309 are named with a 24 character hex string.  Spot
> inspection of these shows that most (all?) of these contain a single 0 byte
> file named "0".
>
> The RangeServer log file since the restart currently contains over 835,000
> lines.  The bulk seems to be lines like:
>
> 1220752472 INFO Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
> replay_update - length=30
> 1220752472 INFO Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
> replay_update - length=30
> 1220752472 INFO Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
> replay_update - length=30
> 1220752472 INFO Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
> replay_update - length=30
> 1220752472 INFO Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
> replay_update - length=30
>
> The memory usage may be the same issue that Donald was reporting earlier in
> his discussion of fragmentation.  The new RangeServer process has grown up
> to 1.5 GB of memory again, but the max cache size is 200 MB (default).
>
> I'd been loading into a 15-node Hypertable cluster all week using a single
> loader process.  I'd loaded about 5 billion cells, or around 1.5 TB of data
> before I decided to kill the loader because it was taking too long (and that
> one server was getting huge).  The total data set size is around 3.5 TB and
> it took under a week to generate the original set (using 15-way parallelism,
> not just a single loader), so I decided to trying to load the rest in a
> distributed manner.
>
> The loading was happening in ascending row order.  It seems like all of the
> loading was happening on the same server.  I'm guessing that when splits
> happened, the low range got moved off, and the same server continued to load
> the end range.  That might explain why one server was getting all the
> traffic.
>
> Looking at HDFS disk usage, the loaded server has 954 GB of disk used for
> Hadoop and the other 14 all have around 140 GB of disk usage.  This behavior
> also has me wondering what happens when that one machine fills up (another
> couple hundred GB).  Does the whole system crash, or does HDFS get smarter
> about balancing?
>
> Josh
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Reply via email to