DashengJu created STORM-738:
-------------------------------
Summary: Multilang needs Overflow-Control and HeartBeat bug
Key: STORM-738
URL: https://issues.apache.org/jira/browse/STORM-738
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.10.0, 0.9.3-rc2, 0.9.4, 0.11.0
Reporter: DashengJu
Priority: Critical
hi, all
we have a topology, which have 3 components(spout->parser->saver) and the
parser is Multilang bolt with python. We do not use ACK mechanism.
we found 2 problems with Mutilang python script.
1) the parser python scripts may hold too many tuples and consume too many
memory;
2) with MultiLang heartbeat mechanism described by
https://issues.apache.org/jira/browse/STORM-513, the python script always
timeout to heartbeat, even when the parser bolt is normal, cause supervisor to
restart itself.
!http://yun.baidu.com/share/link?shareid=3956686758&uk=1124463074!
ShellBolt process === Father-Process
PythonScript process === Child-Process
The reason is :
1) when topology do not use ACK mechanism, the spout do not have
Overflow-control ability, if the stream have too many tuples comes, spout will
send all the tuples to parser's ShellBolt process(Father-Process);
2) parser's ShellBolt process just put the tuples to _pendingWrites queue, if
the _pendingWrites queue does not have limit;
3) parser's PythonScript process(Child-Process) call readMsg() to read a tuple
from STDIN, handle the tuple, and emit a new tuple to its father process
through STDOUT, and then call readTaskIds() from STDIN. Because
Father-Process's queue already have too many other tuples, Child-Process will
read all the tuples to pending_commands, util received TaskIds.
4) so Child-Process process's pending_commands may contains too many tuples and
consume too many memory.
As to heartbeat, because there are too many pending_commands need Child-Process
to handle, and Child-Process's every emit operation will need more I/O read
operations from STDIN. It may need 10 seconds to handle one tuple, and this
will cause the heartbeat tuple not handle quickly, and timeout will happen.
Even if Father-Process's _pendingWrites have limits, for example 1000,
Child-Process may needs 1000 x 1000 read operations then it can handle the
heartbeat tuple.
[~revans2] [~kabhwan] this related to Multilang and heartbeat, please help to
confirm the two problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)