Hello all, For some reason, when I set the ratio of executors to threads equal to 1:1 I do not see the problem of non-starting tasks.
Maybe, from what I understand is that when I have an executor (thread) handling 2 tasks, and the first task blocks in an I/O (in my custom code), the other task is not executed by the executor, because it waits for that I/O. Cheers, Nick 2015-09-03 9:08 GMT-04:00 Ritesh Sinha <[email protected]>: > If your workers are not starting try executing the command which > supervisor executes to start the workers.You can get those commands from > the supervisor logs. Execute the command to manually start the workers and > see if you are getting any error. > > On Thu, Sep 3, 2015 at 6:30 PM, Abhishek Agarwal <[email protected]> > wrote: > >> When you say that tasks do not start, do you mean that worker process >> itself is not starting? >> >> On Thu, Sep 3, 2015 at 5:20 PM, John Yost <[email protected]> >> wrote: >> >>> Hi Nick, >>> >>> What do the nimbus and supervisor logs say? One or both may contain >>> clues as to why your workers are not starting up. >>> >>> --John >>> >>> On Thu, Sep 3, 2015 at 4:44 AM, Matthias J. Sax <[email protected]> >>> wrote: >>> >>>> I am currently working with version 0.11.0-SNAPSHOT and cannot observe >>>> the behavior you describe. If I submit a sample topology with 1 spout >>>> (dop=1) and 1 bolt (dop=10) connected via shuffle grouping and have 12 >>>> supervisor available (each with 12 worker slots), each of the 11 >>>> executors is running on a single worker of a single supervisor (host). >>>> >>>> I am not idea why you observe a different behavior... >>>> >>>> -Matthias >>>> >>>> On 09/03/2015 12:20 AM, Nick R. Katsipoulakis wrote: >>>> > When I say co-locate, what I have seen in my experiments is the >>>> following: >>>> > >>>> > If the executor's number can be served by workers on one node, the >>>> > scheduler spawns all the executors in the workers of one node. I have >>>> > also seen that behavior in that the default scheduler tries to fill up >>>> > one node before provisioning an additional one for the topology. >>>> > >>>> > Going back to your following sentence "and the executors should be >>>> > evenly distributed over all available workers." I have to say that I >>>> do >>>> > not see that often in my experiments. Actually, I often come across >>>> with >>>> > workers handling 2 - 3 executors/tasks, and other doing nothing. Am I >>>> > missing something? Is it just a coincidence that happened in my >>>> experiments? >>>> > >>>> > Thank you, >>>> > Nick >>>> > >>>> > >>>> > >>>> > 2015-09-02 17:38 GMT-04:00 Matthias J. Sax <[email protected] >>>> > <mailto:[email protected]>>: >>>> > >>>> > I agree. The load is not high. >>>> > >>>> > About higher latencies. How many ackers did you configure? As a >>>> rule of >>>> > thumb there should be one acker per executor. If you have less >>>> ackers, >>>> > and an increasing number of executors, this might cause the >>>> increased >>>> > latency as the ackers could become a bottleneck. >>>> > >>>> > What do you mean by "trying to co-locate tasks and executors as >>>> much as >>>> > possible"? Tasks a logical units of works that are processed by >>>> > executors (which are threads). Furthermore (as far as I know), the >>>> > default scheduler does a evenly distributed assignment for tasks >>>> and >>>> > executor to the available workers. In you case, as you set the >>>> number of >>>> > task equal to the number of executors, each executors processes a >>>> single >>>> > task, and the executors should be evenly distributed over all >>>> available >>>> > workers. >>>> > >>>> > However, you are right: intra-worker channels are "cheaper" than >>>> > inter-worker channels. In order to exploit this, you should use >>>> > shuffle-or-local grouping instead of shuffle. The disadvantage of >>>> > shuffle-or-local might be missing load-balancing. Shuffle always >>>> ensures >>>> > good load balancing. >>>> > >>>> > >>>> > -Matthias >>>> > >>>> > >>>> > >>>> > On 09/02/2015 10:31 PM, Nick R. Katsipoulakis wrote: >>>> > > Well, my input load is 4 streams at 4000 tuples per second, and >>>> each >>>> > > tuple is about 128 bytes long. Therefore, I do not think my >>>> load is too >>>> > > much for my hardware. >>>> > > >>>> > > No, I am running only this topology in my cluster. >>>> > > >>>> > > For some reason, when I set the task to executor ratio to 1, my >>>> topology >>>> > > does not hang at all. The strange thing now is that I see >>>> higher latency >>>> > > with more executors and I am trying to figure this out. Also, I >>>> see that >>>> > > the default scheduler is trying to co-locate tasks and >>>> executors as much >>>> > > as possible. Is this true? If yes, is it because the >>>> intra-worker >>>> > > latencies are much lower than the inter-worker latencies? >>>> > > >>>> > > Thanks, >>>> > > Nick >>>> > > >>>> > > 2015-09-02 16:27 GMT-04:00 Matthias J. Sax <[email protected] >>>> <mailto:[email protected]> >>>> > > <mailto:[email protected] <mailto:[email protected]>>>: >>>> > > >>>> > > So (for each node) you have 4 cores available for 1 >>>> supervisor JVM, 2 >>>> > > worker JVMs that execute up to 5 thread each (if 40 >>>> executors are >>>> > > distributed evenly over all workers. Thus, about 12 threads >>>> for 4 cores. >>>> > > Or course, Storm starts a few more threads within each >>>> > > worker/supervisor. >>>> > > >>>> > > If your load is not huge, this might be sufficient. >>>> However, having high >>>> > > data rate, it might be problematic. >>>> > > >>>> > > One more question: do you run a single topology in your >>>> cluster or >>>> > > multiple? Storm isolates topologies for fault-tolerance >>>> reasons. Thus, a >>>> > > single worker cannot process executors from different >>>> topologies. If you >>>> > > run out of workers, a topology might not start up >>>> completely. >>>> > > >>>> > > -Matthias >>>> > > >>>> > > >>>> > > >>>> > > On 09/02/2015 09:54 PM, Nick R. Katsipoulakis wrote: >>>> > > > Hello Matthias and thank you for your reply. See my >>>> answers below: >>>> > > > >>>> > > > - I have a 4 supervisor nodes in my AWS cluster of >>>> m4.xlarge instances >>>> > > > (4 cores per node). On top of that I have 3 more nodes >>>> for zookeeper and >>>> > > > nimbus. >>>> > > > - 2 worker nodes per supervisor node >>>> > > > - The task number for each bolt ranges from 1 to 4 and I >>>> use 1:1 task to >>>> > > > executor assignment. >>>> > > > - The number of executors in total for the topology >>>> ranges from 14 to 41 >>>> > > > >>>> > > > Thanks, >>>> > > > Nick >>>> > > > >>>> > > > 2015-09-02 15:42 GMT-04:00 Matthias J. Sax < >>>> [email protected] <mailto:[email protected]> <mailto:[email protected] >>>> > <mailto:[email protected]>> >>>> > > > <mailto:[email protected] <mailto:[email protected]> >>>> > <mailto:[email protected] <mailto:[email protected]>>>>: >>>> > > > >>>> > > > Without any exception/error message it is hard to >>>> tell. >>>> > > > >>>> > > > What is your cluster setup >>>> > > > - Hardware, ie, number of cores per node? >>>> > > > - How many node/supervisor are available? >>>> > > > - Configured number of workers for the topology? >>>> > > > - What is the number of task for each spout/bolt? >>>> > > > - What is the number of executors for each >>>> spout/bolt? >>>> > > > >>>> > > > -Matthias >>>> > > > >>>> > > > On 09/02/2015 08:02 PM, Nick R. Katsipoulakis wrote: >>>> > > > > Hello all, >>>> > > > > >>>> > > > > I am working on a project in which I submit a >>>> topology >>>> > to my >>>> > > Storm >>>> > > > > cluster, but for some reason, some of my tasks do >>>> not >>>> > start >>>> > > executing. >>>> > > > > >>>> > > > > I can see that the above is happening because every >>>> > bolt I have >>>> > > > needs to >>>> > > > > connect to an external server and do a registration >>>> to a >>>> > > service. >>>> > > > > However, some of the bolts do not seem to connect. >>>> > > > > >>>> > > > > I have to say that the number of tasks I have is >>>> > larger than the >>>> > > > number >>>> > > > > of workers of my cluster. Also, I check my worker >>>> log >>>> > files, >>>> > > and I see >>>> > > > > that the workers that do not register, are also not >>>> > writing some >>>> > > > > initialization messages I have them print in the >>>> > beginning. >>>> > > > > >>>> > > > > Any idea why this is happening? Can it be because my >>>> > > resources are not >>>> > > > > enough to start off all of the tasks? >>>> > > > > >>>> > > > > Thank you, >>>> > > > > Nick >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > -- >>>> > > > Nikolaos Romanos Katsipoulakis, >>>> > > > University of Pittsburgh, PhD candidate >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > -- >>>> > > Nikolaos Romanos Katsipoulakis, >>>> > > University of Pittsburgh, PhD candidate >>>> > >>>> > >>>> > >>>> > >>>> > -- >>>> > Nikolaos Romanos Katsipoulakis, >>>> > University of Pittsburgh, PhD candidate >>>> >>>> >>> >> >> >> -- >> Regards, >> Abhishek Agarwal >> >> > -- Nikolaos Romanos Katsipoulakis, University of Pittsburgh, PhD candidate
