I had a RangeServer process that was taking up around 5.8 GB of memory so I shot it down and restarted it. The RangeServer has spent the last 80 CPU-minutes (>115 minutes on the clock) in local_recover(). Is this normal?
Looking around HDFS, I see around 3670 files in server's /.../log/user/ directory, most of which are around 100 MB in size (total directory size: 351,031,700,665 bytes). I also see 4311 directories in the parent directory, of which 4309 are named with a 24 character hex string. Spot inspection of these shows that most (all?) of these contain a single 0 byte file named "0". The RangeServer log file since the restart currently contains over 835,000 lines. The bulk seems to be lines like: 1220752472 INFO Hypertable.RangeServer : (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) replay_update - length=30 1220752472 INFO Hypertable.RangeServer : (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) replay_update - length=30 1220752472 INFO Hypertable.RangeServer : (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) replay_update - length=30 1220752472 INFO Hypertable.RangeServer : (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) replay_update - length=30 1220752472 INFO Hypertable.RangeServer : (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) replay_update - length=30 The memory usage may be the same issue that Donald was reporting earlier in his discussion of fragmentation. The new RangeServer process has grown up to 1.5 GB of memory again, but the max cache size is 200 MB (default). I'd been loading into a 15-node Hypertable cluster all week using a single loader process. I'd loaded about 5 billion cells, or around 1.5 TB of data before I decided to kill the loader because it was taking too long (and that one server was getting huge). The total data set size is around 3.5 TB and it took under a week to generate the original set (using 15-way parallelism, not just a single loader), so I decided to trying to load the rest in a distributed manner. The loading was happening in ascending row order. It seems like all of the loading was happening on the same server. I'm guessing that when splits happened, the low range got moved off, and the same server continued to load the end range. That might explain why one server was getting all the traffic. Looking at HDFS disk usage, the loaded server has 954 GB of disk used for Hadoop and the other 14 all have around 140 GB of disk usage. This behavior also has me wondering what happens when that one machine fills up (another couple hundred GB). Does the whole system crash, or does HDFS get smarter about balancing? Josh --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
