Re: Tasks are not starting

Abhishek Agarwal Thu, 03 Sep 2015 09:45:04 -0700

can you take the stack trace and post here?  May be that could give clue to
where the task is stuck.


On Thu, Sep 3, 2015 at 6:42 PM, John Yost <[email protected]> wrote:

> Hi Nick,
>
> Gotcha, OK, the worker starts but the task does not.  That's interesting.
> I am pretty sure that, in my experience, if tasks assigned to a worker
> don't start, then the worker is killed by the supervisor. Admittedly, I am
> still a Storm nube, so I may be missing something. :) Please confirm if the
> supervisor is killing the workers if the tasks don't start.
>
> --John
>
> On Thu, Sep 3, 2015 at 9:09 AM, Nick R. Katsipoulakis <
> [email protected]> wrote:
>
>> Hello all,
>>
>> @John: The supervisor logs say nothing, except from the normal startup
>> messages.
>>
>> @Abhishek: No, the worker starts properly. The task itself does not start.
>>
>> Thanks,
>> Nick
>>
>> 2015-09-03 9:00 GMT-04:00 Abhishek Agarwal <[email protected]>:
>>
>>> When you say that tasks do not start, do you mean that worker process
>>> itself is not starting?
>>>
>>> On Thu, Sep 3, 2015 at 5:20 PM, John Yost <[email protected]>
>>> wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> What do the nimbus and supervisor logs say? One or both may contain
>>>> clues as to why your workers are not starting up.
>>>>
>>>> --John
>>>>
>>>> On Thu, Sep 3, 2015 at 4:44 AM, Matthias J. Sax <[email protected]>
>>>> wrote:
>>>>
>>>>> I am currently working with version 0.11.0-SNAPSHOT and cannot observe
>>>>> the behavior you describe. If I submit a sample topology with 1 spout
>>>>> (dop=1) and 1 bolt (dop=10) connected via shuffle grouping and have 12
>>>>> supervisor available (each with 12 worker slots), each of the 11
>>>>> executors is running on a single worker of a single supervisor (host).
>>>>>
>>>>> I am not idea why you observe a different behavior...
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 09/03/2015 12:20 AM, Nick R. Katsipoulakis wrote:
>>>>> > When I say co-locate, what I have seen in my experiments is the
>>>>> following:
>>>>> >
>>>>> > If the executor's number can be served by workers on one node, the
>>>>> > scheduler spawns all the executors in the workers of one node. I have
>>>>> > also seen that behavior in that the default scheduler tries to fill
>>>>> up
>>>>> > one node before provisioning an additional one for the topology.
>>>>> >
>>>>> > Going back to your following sentence "and the executors should be
>>>>> > evenly distributed over all available workers." I have to say that I
>>>>> do
>>>>> > not see that often in my experiments. Actually, I often come across
>>>>> with
>>>>> > workers handling 2 - 3 executors/tasks, and other doing nothing. Am I
>>>>> > missing something? Is it just a coincidence that happened in my
>>>>> experiments?
>>>>> >
>>>>> > Thank you,
>>>>> > Nick
>>>>> >
>>>>> >
>>>>> >
>>>>> > 2015-09-02 17:38 GMT-04:00 Matthias J. Sax <[email protected]
>>>>> > <mailto:[email protected]>>:
>>>>> >
>>>>> >     I agree. The load is not high.
>>>>> >
>>>>> >     About higher latencies. How many ackers did you configure? As a
>>>>> rule of
>>>>> >     thumb there should be one acker per executor. If you have less
>>>>> ackers,
>>>>> >     and an increasing number of executors, this might cause the
>>>>> increased
>>>>> >     latency as the ackers could become a bottleneck.
>>>>> >
>>>>> >     What do you mean by "trying to co-locate tasks and executors as
>>>>> much as
>>>>> >     possible"? Tasks a logical units of works that are processed by
>>>>> >     executors (which are threads). Furthermore (as far as I know),
>>>>> the
>>>>> >     default scheduler does a evenly distributed assignment for tasks
>>>>> and
>>>>> >     executor to the available workers. In you case, as you set the
>>>>> number of
>>>>> >     task equal to the number of executors, each executors processes
>>>>> a single
>>>>> >     task, and the executors should be evenly distributed over all
>>>>> available
>>>>> >     workers.
>>>>> >
>>>>> >     However, you are right: intra-worker channels are "cheaper" than
>>>>> >     inter-worker channels. In order to exploit this, you should use
>>>>> >     shuffle-or-local grouping instead of shuffle. The disadvantage of
>>>>> >     shuffle-or-local might be missing load-balancing. Shuffle always
>>>>> ensures
>>>>> >     good load balancing.
>>>>> >
>>>>> >
>>>>> >     -Matthias
>>>>> >
>>>>> >
>>>>> >
>>>>> >     On 09/02/2015 10:31 PM, Nick R. Katsipoulakis wrote:
>>>>> >     > Well, my input load is 4 streams at 4000 tuples per second,
>>>>> and each
>>>>> >     > tuple is about 128 bytes long. Therefore, I do not think my
>>>>> load is too
>>>>> >     > much for my hardware.
>>>>> >     >
>>>>> >     > No, I am running only this topology in my cluster.
>>>>> >     >
>>>>> >     > For some reason, when I set the task to executor ratio to 1,
>>>>> my topology
>>>>> >     > does not hang at all. The strange thing now is that I see
>>>>> higher latency
>>>>> >     > with more executors and I am trying to figure this out. Also,
>>>>> I see that
>>>>> >     > the default scheduler is trying to co-locate tasks and
>>>>> executors as much
>>>>> >     > as possible. Is this true? If yes, is it because the
>>>>> intra-worker
>>>>> >     > latencies are much lower than the inter-worker latencies?
>>>>> >     >
>>>>> >     > Thanks,
>>>>> >     > Nick
>>>>> >     >
>>>>> >     > 2015-09-02 16:27 GMT-04:00 Matthias J. Sax <[email protected]
>>>>> <mailto:[email protected]>
>>>>> >     > <mailto:[email protected] <mailto:[email protected]>>>:
>>>>> >     >
>>>>> >     >     So (for each node) you have 4 cores available for 1
>>>>> supervisor JVM, 2
>>>>> >     >     worker JVMs that execute up to 5 thread each (if 40
>>>>> executors are
>>>>> >     >     distributed evenly over all workers. Thus, about 12
>>>>> threads for 4 cores.
>>>>> >     >     Or course, Storm starts a few more threads within each
>>>>> >     >     worker/supervisor.
>>>>> >     >
>>>>> >     >     If your load is not huge, this might be sufficient.
>>>>> However, having high
>>>>> >     >     data rate, it might be problematic.
>>>>> >     >
>>>>> >     >     One more question: do you run a single topology in your
>>>>> cluster or
>>>>> >     >     multiple? Storm isolates topologies for fault-tolerance
>>>>> reasons. Thus, a
>>>>> >     >     single worker cannot process executors from different
>>>>> topologies. If you
>>>>> >     >     run out of workers, a topology might not start up
>>>>> completely.
>>>>> >     >
>>>>> >     >     -Matthias
>>>>> >     >
>>>>> >     >
>>>>> >     >
>>>>> >     >     On 09/02/2015 09:54 PM, Nick R. Katsipoulakis wrote:
>>>>> >     >     > Hello Matthias and thank you for your reply. See my
>>>>> answers below:
>>>>> >     >     >
>>>>> >     >     > - I have a 4 supervisor nodes in my AWS cluster of
>>>>> m4.xlarge instances
>>>>> >     >     > (4 cores per node). On top of that I have 3 more nodes
>>>>> for zookeeper and
>>>>> >     >     > nimbus.
>>>>> >     >     > - 2 worker nodes per supervisor node
>>>>> >     >     > - The task number for each bolt ranges from 1 to 4 and I
>>>>> use 1:1 task to
>>>>> >     >     > executor assignment.
>>>>> >     >     > - The number of executors in total for the topology
>>>>> ranges from 14 to 41
>>>>> >     >     >
>>>>> >     >     > Thanks,
>>>>> >     >     > Nick
>>>>> >     >     >
>>>>> >     >     > 2015-09-02 15:42 GMT-04:00 Matthias J. Sax <
>>>>> [email protected] <mailto:[email protected]> <mailto:[email protected]
>>>>> >     <mailto:[email protected]>>
>>>>> >     >     > <mailto:[email protected] <mailto:[email protected]>
>>>>> >     <mailto:[email protected] <mailto:[email protected]>>>>:
>>>>> >     >     >
>>>>> >     >     >     Without any exception/error message it is hard to
>>>>> tell.
>>>>> >     >     >
>>>>> >     >     >     What is your cluster setup
>>>>> >     >     >       - Hardware, ie, number of cores per node?
>>>>> >     >     >       - How many node/supervisor are available?
>>>>> >     >     >       - Configured number of workers for the topology?
>>>>> >     >     >       - What is the number of task for each spout/bolt?
>>>>> >     >     >       - What is the number of executors for each
>>>>> spout/bolt?
>>>>> >     >     >
>>>>> >     >     >     -Matthias
>>>>> >     >     >
>>>>> >     >     >     On 09/02/2015 08:02 PM, Nick R. Katsipoulakis wrote:
>>>>> >     >     >     > Hello all,
>>>>> >     >     >     >
>>>>> >     >     >     > I am working on a project in which I submit a
>>>>> topology
>>>>> >     to my
>>>>> >     >     Storm
>>>>> >     >     >     > cluster, but for some reason, some of my tasks do
>>>>> not
>>>>> >     start
>>>>> >     >     executing.
>>>>> >     >     >     >
>>>>> >     >     >     > I can see that the above is happening because every
>>>>> >     bolt I have
>>>>> >     >     >     needs to
>>>>> >     >     >     > connect to an external server and do a
>>>>> registration to a
>>>>> >     >     service.
>>>>> >     >     >     > However, some of the bolts do not seem to connect.
>>>>> >     >     >     >
>>>>> >     >     >     > I have to say that the number of tasks I have is
>>>>> >     larger than the
>>>>> >     >     >     number
>>>>> >     >     >     > of workers of my cluster. Also, I check my worker
>>>>> log
>>>>> >     files,
>>>>> >     >     and I see
>>>>> >     >     >     > that the workers that do not register, are also not
>>>>> >     writing some
>>>>> >     >     >     > initialization messages I have them print in the
>>>>> >     beginning.
>>>>> >     >     >     >
>>>>> >     >     >     > Any idea why this is happening? Can it be because
>>>>> my
>>>>> >     >     resources are not
>>>>> >     >     >     > enough to start off all of the tasks?
>>>>> >     >     >     >
>>>>> >     >     >     > Thank you,
>>>>> >     >     >     > Nick
>>>>> >     >     >
>>>>> >     >     >
>>>>> >     >     >
>>>>> >     >     >
>>>>> >     >     > --
>>>>> >     >     > Nikolaos Romanos Katsipoulakis,
>>>>> >     >     > University of Pittsburgh, PhD candidate
>>>>> >     >
>>>>> >     >
>>>>> >     >
>>>>> >     >
>>>>> >     > --
>>>>> >     > Nikolaos Romanos Katsipoulakis,
>>>>> >     > University of Pittsburgh, PhD candidate
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Nikolaos Romanos Katsipoulakis,
>>>>> > University of Pittsburgh, PhD candidate
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Abhishek Agarwal
>>>
>>>
>>
>>
>> --
>> Nikolaos Romanos Katsipoulakis,
>> University of Pittsburgh, PhD candidate
>>
>
>


-- 
Regards,
Abhishek Agarwal

Re: Tasks are not starting

Reply via email to