I was running Riak 2.0pre11 but now see the same problem on pre20
I've reduced the Riak cluster to one single node, to eliminate the
inter-node communication from the issue. From another server I run a script
to do 100,000 inserts using the Python client (presumable 1.4.something).
Each insert is in a loop with 3 retries and it always specifies a server
timeout value. For this test, the HTML docs are small enough that the
60,000 millisec default timeout value is always specified. Currently there
is no socket timeout specified on the client side.
Part way through, one of the inserts hung. On investigation the Riak server
seemed in an OK state. I ran strace ps ax and it did not hang. strace
riak-admin status also was OK. top showed one of the riak processes and
strace -p PID showed that it was waiting in select. But then, after a
retry, the client continued to do inserts. That particular strace showed no
change so not sure whether the process was important. Ran top again and the
same process showed 80% CPU utilization.
Then we got a full hang of Riak. The client did not retry because the
server did not timeout. It just hung and hung. Over 15 minutes as I write
this. When I ran strace ps ax on the Riak server, it hung reading
/proc/PID/cmdline where PID was the same as the one mentioned above. When I
run pstree -p (which never hangs) it shows this
|-run_erl(4171)---beam.smp(4173)-+-cpu_sup(4473)
4173 is the PID that I have been talking about. Oddly enough, when I opened
a new SSH connection to this server, the strace ps ax which had been hung
on opening a /proc file, suddenly ran to completion. However, running it
again, hung again on the same line. Here are a few lines of strace ps ax
stat("/proc/4173", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/4173/stat", O_RDONLY) = 6
read(6, "4173 (beam.smp) S 4171 4173 4173"..., 1024) = 277
read(6, "", 747) = 0
close(6) = 0
open("/proc/4173/status", O_RDONLY) = 6
read(6, "Name:\tbeam.smp\nState:\tS (sleepin"..., 1024) = 787
read(6, "", 237) = 0
close(6) = 0
open("/proc/4173/cmdline", O_RDONLY) = 6
read(6,
Any idea what is happening?
When Riak is running normally, is there a way to identify a PID which would
be useful to attach to strace if I see this problem developing? Or some
other way to look at status of all the different beam.smp processes and
identify where the problem is located?
Doesn't this indicate a problem with the way that Riak implements the
server timeout? Shouldn't some supervisor be killing and restarting a child
process or subtree when this occurs?
--
Michael Dillon - Senior Software Engineer
PageFreezer.com
#200 - 311 Water Street
Vancouver, BC V6B 1B8
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com