can you take the stack trace and post here? May be that could give clue to where the task is stuck.
On Thu, Sep 3, 2015 at 6:42 PM, John Yost <[email protected]> wrote: > Hi Nick, > > Gotcha, OK, the worker starts but the task does not. That's interesting. > I am pretty sure that, in my experience, if tasks assigned to a worker > don't start, then the worker is killed by the supervisor. Admittedly, I am > still a Storm nube, so I may be missing something. :) Please confirm if the > supervisor is killing the workers if the tasks don't start. > > --John > > On Thu, Sep 3, 2015 at 9:09 AM, Nick R. Katsipoulakis < > [email protected]> wrote: > >> Hello all, >> >> @John: The supervisor logs say nothing, except from the normal startup >> messages. >> >> @Abhishek: No, the worker starts properly. The task itself does not start. >> >> Thanks, >> Nick >> >> 2015-09-03 9:00 GMT-04:00 Abhishek Agarwal <[email protected]>: >> >>> When you say that tasks do not start, do you mean that worker process >>> itself is not starting? >>> >>> On Thu, Sep 3, 2015 at 5:20 PM, John Yost <[email protected]> >>> wrote: >>> >>>> Hi Nick, >>>> >>>> What do the nimbus and supervisor logs say? One or both may contain >>>> clues as to why your workers are not starting up. >>>> >>>> --John >>>> >>>> On Thu, Sep 3, 2015 at 4:44 AM, Matthias J. Sax <[email protected]> >>>> wrote: >>>> >>>>> I am currently working with version 0.11.0-SNAPSHOT and cannot observe >>>>> the behavior you describe. If I submit a sample topology with 1 spout >>>>> (dop=1) and 1 bolt (dop=10) connected via shuffle grouping and have 12 >>>>> supervisor available (each with 12 worker slots), each of the 11 >>>>> executors is running on a single worker of a single supervisor (host). >>>>> >>>>> I am not idea why you observe a different behavior... >>>>> >>>>> -Matthias >>>>> >>>>> On 09/03/2015 12:20 AM, Nick R. Katsipoulakis wrote: >>>>> > When I say co-locate, what I have seen in my experiments is the >>>>> following: >>>>> > >>>>> > If the executor's number can be served by workers on one node, the >>>>> > scheduler spawns all the executors in the workers of one node. I have >>>>> > also seen that behavior in that the default scheduler tries to fill >>>>> up >>>>> > one node before provisioning an additional one for the topology. >>>>> > >>>>> > Going back to your following sentence "and the executors should be >>>>> > evenly distributed over all available workers." I have to say that I >>>>> do >>>>> > not see that often in my experiments. Actually, I often come across >>>>> with >>>>> > workers handling 2 - 3 executors/tasks, and other doing nothing. Am I >>>>> > missing something? Is it just a coincidence that happened in my >>>>> experiments? >>>>> > >>>>> > Thank you, >>>>> > Nick >>>>> > >>>>> > >>>>> > >>>>> > 2015-09-02 17:38 GMT-04:00 Matthias J. Sax <[email protected] >>>>> > <mailto:[email protected]>>: >>>>> > >>>>> > I agree. The load is not high. >>>>> > >>>>> > About higher latencies. How many ackers did you configure? As a >>>>> rule of >>>>> > thumb there should be one acker per executor. If you have less >>>>> ackers, >>>>> > and an increasing number of executors, this might cause the >>>>> increased >>>>> > latency as the ackers could become a bottleneck. >>>>> > >>>>> > What do you mean by "trying to co-locate tasks and executors as >>>>> much as >>>>> > possible"? Tasks a logical units of works that are processed by >>>>> > executors (which are threads). Furthermore (as far as I know), >>>>> the >>>>> > default scheduler does a evenly distributed assignment for tasks >>>>> and >>>>> > executor to the available workers. In you case, as you set the >>>>> number of >>>>> > task equal to the number of executors, each executors processes >>>>> a single >>>>> > task, and the executors should be evenly distributed over all >>>>> available >>>>> > workers. >>>>> > >>>>> > However, you are right: intra-worker channels are "cheaper" than >>>>> > inter-worker channels. In order to exploit this, you should use >>>>> > shuffle-or-local grouping instead of shuffle. The disadvantage of >>>>> > shuffle-or-local might be missing load-balancing. Shuffle always >>>>> ensures >>>>> > good load balancing. >>>>> > >>>>> > >>>>> > -Matthias >>>>> > >>>>> > >>>>> > >>>>> > On 09/02/2015 10:31 PM, Nick R. Katsipoulakis wrote: >>>>> > > Well, my input load is 4 streams at 4000 tuples per second, >>>>> and each >>>>> > > tuple is about 128 bytes long. Therefore, I do not think my >>>>> load is too >>>>> > > much for my hardware. >>>>> > > >>>>> > > No, I am running only this topology in my cluster. >>>>> > > >>>>> > > For some reason, when I set the task to executor ratio to 1, >>>>> my topology >>>>> > > does not hang at all. The strange thing now is that I see >>>>> higher latency >>>>> > > with more executors and I am trying to figure this out. Also, >>>>> I see that >>>>> > > the default scheduler is trying to co-locate tasks and >>>>> executors as much >>>>> > > as possible. Is this true? If yes, is it because the >>>>> intra-worker >>>>> > > latencies are much lower than the inter-worker latencies? >>>>> > > >>>>> > > Thanks, >>>>> > > Nick >>>>> > > >>>>> > > 2015-09-02 16:27 GMT-04:00 Matthias J. Sax <[email protected] >>>>> <mailto:[email protected]> >>>>> > > <mailto:[email protected] <mailto:[email protected]>>>: >>>>> > > >>>>> > > So (for each node) you have 4 cores available for 1 >>>>> supervisor JVM, 2 >>>>> > > worker JVMs that execute up to 5 thread each (if 40 >>>>> executors are >>>>> > > distributed evenly over all workers. Thus, about 12 >>>>> threads for 4 cores. >>>>> > > Or course, Storm starts a few more threads within each >>>>> > > worker/supervisor. >>>>> > > >>>>> > > If your load is not huge, this might be sufficient. >>>>> However, having high >>>>> > > data rate, it might be problematic. >>>>> > > >>>>> > > One more question: do you run a single topology in your >>>>> cluster or >>>>> > > multiple? Storm isolates topologies for fault-tolerance >>>>> reasons. Thus, a >>>>> > > single worker cannot process executors from different >>>>> topologies. If you >>>>> > > run out of workers, a topology might not start up >>>>> completely. >>>>> > > >>>>> > > -Matthias >>>>> > > >>>>> > > >>>>> > > >>>>> > > On 09/02/2015 09:54 PM, Nick R. Katsipoulakis wrote: >>>>> > > > Hello Matthias and thank you for your reply. See my >>>>> answers below: >>>>> > > > >>>>> > > > - I have a 4 supervisor nodes in my AWS cluster of >>>>> m4.xlarge instances >>>>> > > > (4 cores per node). On top of that I have 3 more nodes >>>>> for zookeeper and >>>>> > > > nimbus. >>>>> > > > - 2 worker nodes per supervisor node >>>>> > > > - The task number for each bolt ranges from 1 to 4 and I >>>>> use 1:1 task to >>>>> > > > executor assignment. >>>>> > > > - The number of executors in total for the topology >>>>> ranges from 14 to 41 >>>>> > > > >>>>> > > > Thanks, >>>>> > > > Nick >>>>> > > > >>>>> > > > 2015-09-02 15:42 GMT-04:00 Matthias J. Sax < >>>>> [email protected] <mailto:[email protected]> <mailto:[email protected] >>>>> > <mailto:[email protected]>> >>>>> > > > <mailto:[email protected] <mailto:[email protected]> >>>>> > <mailto:[email protected] <mailto:[email protected]>>>>: >>>>> > > > >>>>> > > > Without any exception/error message it is hard to >>>>> tell. >>>>> > > > >>>>> > > > What is your cluster setup >>>>> > > > - Hardware, ie, number of cores per node? >>>>> > > > - How many node/supervisor are available? >>>>> > > > - Configured number of workers for the topology? >>>>> > > > - What is the number of task for each spout/bolt? >>>>> > > > - What is the number of executors for each >>>>> spout/bolt? >>>>> > > > >>>>> > > > -Matthias >>>>> > > > >>>>> > > > On 09/02/2015 08:02 PM, Nick R. Katsipoulakis wrote: >>>>> > > > > Hello all, >>>>> > > > > >>>>> > > > > I am working on a project in which I submit a >>>>> topology >>>>> > to my >>>>> > > Storm >>>>> > > > > cluster, but for some reason, some of my tasks do >>>>> not >>>>> > start >>>>> > > executing. >>>>> > > > > >>>>> > > > > I can see that the above is happening because every >>>>> > bolt I have >>>>> > > > needs to >>>>> > > > > connect to an external server and do a >>>>> registration to a >>>>> > > service. >>>>> > > > > However, some of the bolts do not seem to connect. >>>>> > > > > >>>>> > > > > I have to say that the number of tasks I have is >>>>> > larger than the >>>>> > > > number >>>>> > > > > of workers of my cluster. Also, I check my worker >>>>> log >>>>> > files, >>>>> > > and I see >>>>> > > > > that the workers that do not register, are also not >>>>> > writing some >>>>> > > > > initialization messages I have them print in the >>>>> > beginning. >>>>> > > > > >>>>> > > > > Any idea why this is happening? Can it be because >>>>> my >>>>> > > resources are not >>>>> > > > > enough to start off all of the tasks? >>>>> > > > > >>>>> > > > > Thank you, >>>>> > > > > Nick >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > -- >>>>> > > > Nikolaos Romanos Katsipoulakis, >>>>> > > > University of Pittsburgh, PhD candidate >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > -- >>>>> > > Nikolaos Romanos Katsipoulakis, >>>>> > > University of Pittsburgh, PhD candidate >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Nikolaos Romanos Katsipoulakis, >>>>> > University of Pittsburgh, PhD candidate >>>>> >>>>> >>>> >>> >>> >>> -- >>> Regards, >>> Abhishek Agarwal >>> >>> >> >> >> -- >> Nikolaos Romanos Katsipoulakis, >> University of Pittsburgh, PhD candidate >> > > -- Regards, Abhishek Agarwal
