Hi Josh, The 4311 directories are for split logs, they are used while a range is splitting into two. This indicates at least you have 4K+ ranges on that server, which is pretty big (I usually have several hundreds per server). The 3670 files are commit log files, I think it's actually quite good performance to take 115 minutes to replay a total of 3.5G logs, you get 50MB/s throughput anyway. The problem is many of these commit log files should be removed over time, after compactions of the ranges take place. Ideally you'll only have 1 or 2 of these files left after all the maintenance tasks are done. If so, the replay process only costs several seconds.
One reason why the commit log files are not getting reclaimed is due to a bug in the range server code, I've pushed out a fix for it and it should be included in the latest 0.9.0.10 release. Another reason could be that your maintenance task threads are too busy to get the work done in time, you may try to increase the number of maintenance tasks by setting Hypertable.RangeServer.MaintenanceThreads in your hypertable.cfg file. About load balance, I think your guess is right. About HDFS, it seems HDFS always tries to put one copy of the file block on the local datanode. This has good performance, but certainly bad load balance if you keep writing from one server. Donald On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor <[EMAIL PROTECTED]> wrote: > I had a RangeServer process that was taking up around 5.8 GB of memory so I > shot it down and restarted it. The RangeServer has spent the last 80 > CPU-minutes (>115 minutes on the clock) in local_recover(). Is this normal? > > Looking around HDFS, I see around 3670 files in server's /.../log/user/ > directory, most of which are around 100 MB in size (total directory size: > 351,031,700,665 bytes). I also see 4311 directories in the parent > directory, of which 4309 are named with a 24 character hex string. Spot > inspection of these shows that most (all?) of these contain a single 0 byte > file named "0". > > The RangeServer log file since the restart currently contains over 835,000 > lines. The bulk seems to be lines like: > > 1220752472 INFO Hypertable.RangeServer : > (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) > replay_update - length=30 > 1220752472 INFO Hypertable.RangeServer : > (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) > replay_update - length=30 > 1220752472 INFO Hypertable.RangeServer : > (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) > replay_update - length=30 > 1220752472 INFO Hypertable.RangeServer : > (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) > replay_update - length=30 > 1220752472 INFO Hypertable.RangeServer : > (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) > replay_update - length=30 > > The memory usage may be the same issue that Donald was reporting earlier in > his discussion of fragmentation. The new RangeServer process has grown up > to 1.5 GB of memory again, but the max cache size is 200 MB (default). > > I'd been loading into a 15-node Hypertable cluster all week using a single > loader process. I'd loaded about 5 billion cells, or around 1.5 TB of data > before I decided to kill the loader because it was taking too long (and that > one server was getting huge). The total data set size is around 3.5 TB and > it took under a week to generate the original set (using 15-way parallelism, > not just a single loader), so I decided to trying to load the rest in a > distributed manner. > > The loading was happening in ascending row order. It seems like all of the > loading was happening on the same server. I'm guessing that when splits > happened, the low range got moved off, and the same server continued to load > the end range. That might explain why one server was getting all the > traffic. > > Looking at HDFS disk usage, the loaded server has 954 GB of disk used for > Hadoop and the other 14 all have around 140 GB of disk usage. This behavior > also has me wondering what happens when that one machine fills up (another > couple hundred GB). Does the whole system crash, or does HDFS get smarter > about balancing? > > Josh > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
