Dear All,
I have a nimbus detect executor not alive and reassign tasks issue:
We are working on 0.9.2 version of storm + 3.4.6 version for zookeeper.
And it's a single node with 1 instance of storm relies on 1 instance of
zookeeper
On the same machine.
Accidently we get the following issue, about 1-2 hours, the nimbus.log tells
that
No executor is alive like the following:
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[2 2] not alive
... executor not alive code here ommitted
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[64 67] not alive
2014-10-17 01:30:58 b.s.s.EvenScheduler [INFO] Available slots:
(["e18633fb-6613-4afc-bea1-9941084508c5" 6702]
["e18633fb-6613-4afc-bea1-9941084508c5" 6701]
["e18633fb-6613-4afc-bea1-9941084508c5" 6700])
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassigning xxx to 1 slots
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassign executors: [[2 2] [3 3] [4 4]
[5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [27 27] [92 92] [93 93] [1 1] [36 39] [68
71] [40 43] [72 75] [11 14] [44 47] [76 79] [15 18] [48 51] [80 83] [19 22] [52
55] [84 87] [23 26] [56 59] [88 91] [28 31] [60 63] [32 35] [64 67]]
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Setting new assignment for topology id
xxx: #backtype.storm.daemon.common.Assignment{:master-code-dir "yyy",
:node->host {"e18633fb-6613-4afc-bea1-9941084508c5" "pfdwlnx1u"},
:executor->node+port {[2 2] ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [3
3] ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [4 4]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [5 5]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [6 6]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [7 7]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [8 8]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [9 9]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [10 10]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [27 27]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [92 92]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [93 93]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [1 1]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [36 39]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [68 71]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [40 43]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [72 75]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [11 14]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [44 47]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [76 79]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [15 18]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [48 51]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [80 83]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [19 22]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [52 55]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [84 87]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [23 26]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [56 59]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [88 91]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [28 31]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [60 63]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [32 35]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [64 67]
["e18633fb-6613-4afc-bea1-9941084508c5" 6702]}, :executor->start-time-secs {[2
2] 1413523858, [3 3] 1413523858, [4 4] 1413523858, [5 5] 1413523858, [6 6]
1413523858, [7 7] 1413523858, [8 8] 1413523858, [9 9] 1413523858, [10 10]
1413523858, [27 27] 1413523858, [92 92] 1413523858, [93 93] 1413523858, [1 1]
1413523858, [36 39] 1413523858, [68 71] 1413523858, [40 43] 1413523858, [72 75]
1413523858, [11 14] 1413523858, [44 47] 1413523858, [76 79] 1413523858, [15 18]
1413523858, [48 51] 1413523858, [80 83] 1413523858, [19 22] 1413523858, [52 55]
1413523858, [84 87] 1413523858, [23 26] 1413523858, [56 59] 1413523858, [88 91]
1413523858, [28 31] 1413523858, [60 63] 1413523858, [32 35] 1413523858, [64 67]
1413523858}}
Then it reassign the tasks to supervisor. The problem is as we use Kafka Spout
to get messages from Kafka server. We observed that
Certain messages have already been processed before the reassign of tasks
(persist into a Data Base) and the reassign of tasks make those messages been
reflowed as
I believe that the offset of those already been treated kafka messages not
committed to the zookeeper server.
So questions are:
1) I know it's one of the fault tolerant feature provided by storm and kafka,
is this a normal case? Anything wrong with the not alive issue?
2) If there's potential issue with task not alive, how could I check in details
why the task is not alive? As I believe that the nimbus, supervisor process
never dies in our test case.
Hope I make things clear if anything not clear let me know.
Thanks and regards,
Yiming