I am doing 208 MapReduce jobs in rapid-fire succession using anonymous
JavaScript functions. I am sending the MapReduce jobs to a single node,
riak01. There are about 75,000 keys in the bucket.
Erlang: R13B04
Riak: 0.14.2
When I had my MapReduce timeout set to 120,000 ("timeout":120000), I was getting
mapexec_error, {error,timeout}
This first timeout wrote to the error log after seven seconds. The second and
third wrote to the error log after five seconds. The four timeout wrote the
error log after eight seconds. The beam process never crashed.
So, I increased the value to 30,000,000 ("timeout":30000000). In the first
run, all MapReduce jobs completed without error, each one taking about 1 to 3
seconds to complete.
The CPU usage on riak01 was about 50 percent for all 208 jobs.
Below is a sample output from iostat -x
avg-cpu: %user %nice %system %iowait %steal %idle
51.00 0.00 5.01 0.10 0.00 43.89
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
hda 0.00 8.22 0.00 3.21 0.00 91.38 28.50
0.01 2.62 2.12 0.68
In the second run, on the 53rd MapReduce job, the job was still waiting to
complete after 10 minutes. So, there was never a timeout, and nothing was
written to the error logs. However, the beam process obviously crashed. On
raik01, I executed the following commands:
./riak-admin status
Node is not running!
./riak ping
Node '[email protected]' not responding to pings.
./riak attach
Node is not running!
However, ps and top showed the process running.
ps output:
1003 31807 1.0 8.7 172080 132584 pts/1 Rsl+ Jun22 28:53
/home/DMitchell/riak2/riak/rel/riak/erts-5.7.5/bin/beam -K true -A 64 -- -root
/home/DMitchell/riak2/riak/rel/riak -progname riak -- -home /home/DMitchell --
-boot /home/DMitchell/riak2/riak/rel/riak/releases/0.14.2/riak -embedded
-config /home/DMitchell/riak2/riak/rel/riak/etc/app.config -name
[email protected] -setcookie riak -- console
top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31807 DMitchel 25 0 168m 129m 4360 R 99.3 8.7 30:46.08 beam
Below is a sample output from iostat -x when beam was in the crashed state:
avg-cpu: %user %nice %system %iowait %steal %idle
100.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
hda 0.00 1.82 0.00 0.61 0.00 19.45 32.00
0.00 2.00 2.00 0.12
Note the 100 percent CPU usage for the beam process. I terminated the beam
process with: kill -s TERM 31807. Then, I restarted riak.
There were no errors on the other two nodes, except for:
=ERROR REPORT==== 24-Jun-2011::12:29:36 ===
** Node '[email protected]' not responding **
** Removing (timedout) connection **
The MapReduce job is not that complex. I am using a key filter. The map phase
looks for an "LoadRange", and creates a new variable (e.g., "Load1", if there
is a match. The reduce phase counts the matches.
{
"inputs" : {
"bucket" : "names-51013",
"key_filters" : [["starts_with", "22204-1-3"]]
},
"query" : [{
"map" : {
"keep" : false,
"language" :
"javascript",
"arg" : null,
"source" :
"function(value,keyData,arg){var
data=Riak.mapValuesJson(value)[0];if(data.LoadRange&&data.LoadRange==1)
return[{\"data.Load1\":1}];else if(data.LoadRange&&data.LoadRange==2)
return[{\"data.Load2\":1}];else if(data.LoadRange&&data.LoadRange==3)
return[{\"data.Load3\":1}];else if(data.LoadRange&&data.LoadRange==4)
return[{\"data.Load4\":1}];else if(data.LoadRange&&data.LoadRange==5)
return[{\"data.Load5\":1}];else if(data.LoadRange&&data.LoadRange==6)
return[{\"data.Load6\":1}];else if(data.LoadRange&&data.LoadRange==7)
return[{\"data.Load7\":1}];else if(data.LoadRange&&data.LoadRange==8)
return[{\"data.Load8\":1}];else if(data.LoadRange&&data.LoadRange==9)
return[{\"data.Load9\":1}];else if(data.LoadRange&&data.LoadRange==10)
return[{\"data.Load10\":1}];else return[];}"
}
}, {
"reduce" : {
"keep" : true,
"language" :
"javascript",
"arg" : null,
"source" :
"function(v){var s={};for(var i in v){for(var n in v[i]){if(n in s)
s[n]+=v[i][n];else s[n]=v[i][n];}} return[s];}"
}
}
],
"timeout" : 30000000
}
The MapReduce timeout seem to be happening at different places, e.g., during
the map phase, during the reduce phase and during the key filtering phase
(#Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}).
See the URL below for the complete sasl-error.log right before a recent beam
crash.
https://gist.github.com/1045386
Can anyone shed any light on why I am getting timeouts and crashes?
David
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com