[hypertable-dev] RangeServer spending a lot of time in local_recover()

Joshua Taylor Sat, 06 Sep 2008 19:20:37 -0700

I had a RangeServer process that was taking up around 5.8 GB of memory so I
shot it down and restarted it.  The RangeServer has spent the last 80
CPU-minutes (>115 minutes on the clock) in local_recover().  Is this normal?


Looking around HDFS, I see around 3670 files in server's /.../log/user/
directory, most of which are around 100 MB in size (total directory size:
351,031,700,665 bytes).  I also see 4311 directories in the parent
directory, of which 4309 are named with a 24 character hex string.  Spot
inspection of these shows that most (all?) of these contain a single 0 byte
file named "0".

The RangeServer log file since the restart currently contains over 835,000
lines.  The bulk seems to be lines like:

1220752472 INFO Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
replay_update - length=30
1220752472 INFO Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
replay_update - length=30
1220752472 INFO Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
replay_update - length=30
1220752472 INFO Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
replay_update - length=30
1220752472 INFO Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
replay_update - length=30

The memory usage may be the same issue that Donald was reporting earlier in
his discussion of fragmentation.  The new RangeServer process has grown up
to 1.5 GB of memory again, but the max cache size is 200 MB (default).

I'd been loading into a 15-node Hypertable cluster all week using a single
loader process.  I'd loaded about 5 billion cells, or around 1.5 TB of data
before I decided to kill the loader because it was taking too long (and that
one server was getting huge).  The total data set size is around 3.5 TB and
it took under a week to generate the original set (using 15-way parallelism,
not just a single loader), so I decided to trying to load the rest in a
distributed manner.

The loading was happening in ascending row order.  It seems like all of the
loading was happening on the same server.  I'm guessing that when splits
happened, the low range got moved off, and the same server continued to load
the end range.  That might explain why one server was getting all the
traffic.

Looking at HDFS disk usage, the loaded server has 954 GB of disk used for
Hadoop and the other 14 all have around 140 GB of disk usage.  This behavior
also has me wondering what happens when that one machine fills up (another
couple hundred GB).  Does the whole system crash, or does HDFS get smarter
about balancing?

Josh

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] RangeServer spending a lot of time in local_recover()

Reply via email to