nimbus detect executor not alive and reassign tasks issue

Fang, Yiming Thu, 16 Oct 2014 23:17:29 -0700

Dear All,

I have a nimbus detect executor not alive and reassign tasks issue:


We are working on 0.9.2 version of storm + 3.4.6 version for zookeeper.
And it's a single node with 1 instance of storm relies on 1 instance of 
zookeeper
On the same machine.

Accidently we get the following issue, about 1-2 hours, the nimbus.log tells 
that 
No executor is alive like the following:

2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[2 2] not alive
... executor not alive code here ommitted
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[64 67] not alive
2014-10-17 01:30:58 b.s.s.EvenScheduler [INFO] Available slots: 
(["e18633fb-6613-4afc-bea1-9941084508c5" 6702] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6701] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6700])
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassigning xxx to 1 slots
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassign executors: [[2 2] [3 3] [4 4] 
[5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [27 27] [92 92] [93 93] [1 1] [36 39] [68 
71] [40 43] [72 75] [11 14] [44 47] [76 79] [15 18] [48 51] [80 83] [19 22] [52 
55] [84 87] [23 26] [56 59] [88 91] [28 31] [60 63] [32 35] [64 67]]
2014-10-17 01:30:58 b.s.d.nimbus [INFO] Setting new assignment for topology id 
xxx: #backtype.storm.daemon.common.Assignment{:master-code-dir "yyy", 
:node->host {"e18633fb-6613-4afc-bea1-9941084508c5" "pfdwlnx1u"}, 
:executor->node+port {[2 2] ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [3 
3] ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [4 4] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [5 5] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [6 6] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [7 7] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [8 8] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [9 9] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [10 10] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [27 27] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [92 92] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [93 93] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [1 1] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [36 39] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [68 71] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [40 43] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [72 75] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [11 14] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [44 47] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [76 79] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [15 18] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [48 51] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [80 83] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [19 22] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [52 55] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [84 87] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [23 26] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [56 59] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [88 91] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [28 31] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [60 63] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [32 35] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [64 67] 
["e18633fb-6613-4afc-bea1-9941084508c5" 6702]}, :executor->start-time-secs {[2 
2] 1413523858, [3 3] 1413523858, [4 4] 1413523858, [5 5] 1413523858, [6 6] 
1413523858, [7 7] 1413523858, [8 8] 1413523858, [9 9] 1413523858, [10 10] 
1413523858, [27 27] 1413523858, [92 92] 1413523858, [93 93] 1413523858, [1 1] 
1413523858, [36 39] 1413523858, [68 71] 1413523858, [40 43] 1413523858, [72 75] 
1413523858, [11 14] 1413523858, [44 47] 1413523858, [76 79] 1413523858, [15 18] 
1413523858, [48 51] 1413523858, [80 83] 1413523858, [19 22] 1413523858, [52 55] 
1413523858, [84 87] 1413523858, [23 26] 1413523858, [56 59] 1413523858, [88 91] 
1413523858, [28 31] 1413523858, [60 63] 1413523858, [32 35] 1413523858, [64 67] 
1413523858}}

Then it reassign the tasks to supervisor. The problem is as we use Kafka Spout 
to get messages from Kafka server. We observed that
Certain messages have already been processed before the reassign of tasks 
(persist into a Data Base) and the reassign of tasks make those messages been 
reflowed as
I believe that the offset of those already been treated kafka messages not 
committed to the zookeeper server.

So questions are:
1) I know it's one of the fault tolerant feature provided by storm and kafka, 
is this a normal case? Anything wrong with the not alive issue?
2) If there's potential issue with task not alive, how could I check in details 
why the task is not alive? As I believe that the nimbus, supervisor process 
never dies in our test case.

Hope I make things clear if anything not clear let me know. 

Thanks and regards,
Yiming

nimbus detect executor not alive and reassign tasks issue

Reply via email to