Do you see any errors in worker logs thats causing executor failures.
Are you ack the tuple once its successfully written to the database.
On Thu, Oct 16, 2014, at 11:15 PM, Fang, Yiming wrote:
> Dear All,
>
> I have a nimbus detect executor not alive and reassign tasks issue:
>
> We are working on 0.9.2 version of storm + 3.4.6 version for zookeeper.
> And it's a single node with 1 instance of storm relies on 1 instance of
> zookeeper
> On the same machine.
>
> Accidently we get the following issue, about 1-2 hours, the nimbus.log
> tells that
> No executor is alive like the following:
>
> 2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[2 2] not alive
> ... executor not alive code here ommitted
> 2014-10-17 01:30:58 b.s.d.nimbus [INFO] Executor xxx :[64 67] not alive
> 2014-10-17 01:30:58 b.s.s.EvenScheduler [INFO] Available slots:
> (["e18633fb-6613-4afc-bea1-9941084508c5" 6702]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6701]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6700])
> 2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassigning xxx to 1 slots
> 2014-10-17 01:30:58 b.s.d.nimbus [INFO] Reassign executors: [[2 2] [3 3]
> [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [27 27] [92 92] [93 93] [1 1]
> [36 39] [68 71] [40 43] [72 75] [11 14] [44 47] [76 79] [15 18] [48 51]
> [80 83] [19 22] [52 55] [84 87] [23 26] [56 59] [88 91] [28 31] [60 63]
> [32 35] [64 67]]
> 2014-10-17 01:30:58 b.s.d.nimbus [INFO] Setting new assignment for
> topology id xxx:
> #backtype.storm.daemon.common.Assignment{:master-code-dir "yyy",
> :node->host {"e18633fb-6613-4afc-bea1-9941084508c5" "pfdwlnx1u"},
> :executor->node+port {[2 2] ["e18633fb-6613-4afc-bea1-9941084508c5"
> 6702], [3 3] ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [4 4]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [5 5]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [6 6]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [7 7]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [8 8]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [9 9]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [10 10]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [27 27]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [92 92]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [93 93]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [1 1]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [36 39]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [68 71]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [40 43]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [72 75]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [11 14]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [44 47]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [76 79]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [15 18]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [48 51]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [80 83]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [19 22]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [52 55]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [84 87]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [23 26]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [56 59]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [88 91]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [28 31]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [60 63]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [32 35]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702], [64 67]
> ["e18633fb-6613-4afc-bea1-9941084508c5" 6702]},
> :executor->start-time-secs {[2 2] 1413523858, [3 3] 1413523858, [4 4]
> 1413523858, [5 5] 1413523858, [6 6] 1413523858, [7 7] 1413523858, [8 8]
> 1413523858, [9 9] 1413523858, [10 10] 1413523858, [27 27] 1413523858, [92
> 92] 1413523858, [93 93] 1413523858, [1 1] 1413523858, [36 39] 1413523858,
> [68 71] 1413523858, [40 43] 1413523858, [72 75] 1413523858, [11 14]
> 1413523858, [44 47] 1413523858, [76 79] 1413523858, [15 18] 1413523858,
> [48 51] 1413523858, [80 83] 1413523858, [19 22] 1413523858, [52 55]
> 1413523858, [84 87] 1413523858, [23 26] 1413523858, [56 59] 1413523858,
> [88 91] 1413523858, [28 31] 1413523858, [60 63] 1413523858, [32 35]
> 1413523858, [64 67] 1413523858}}
>
> Then it reassign the tasks to supervisor. The problem is as we use Kafka
> Spout to get messages from Kafka server. We observed that
> Certain messages have already been processed before the reassign of tasks
> (persist into a Data Base) and the reassign of tasks make those messages
> been reflowed as
> I believe that the offset of those already been treated kafka messages
> not committed to the zookeeper server.
>
> So questions are:
> 1) I know it's one of the fault tolerant feature provided by storm and
> kafka, is this a normal case? Anything wrong with the not alive issue?
> 2) If there's potential issue with task not alive, how could I check in
> details why the task is not alive? As I believe that the nimbus,
> supervisor process never dies in our test case.
>
> Hope I make things clear if anything not clear let me know.
>
> Thanks and regards,
> Yiming
>