Hello all, @John: The supervisor logs say nothing, except from the normal startup messages.
@Abhishek: No, the worker starts properly. The task itself does not start. Thanks, Nick 2015-09-03 9:00 GMT-04:00 Abhishek Agarwal <[email protected]>: > When you say that tasks do not start, do you mean that worker process > itself is not starting? > > On Thu, Sep 3, 2015 at 5:20 PM, John Yost <[email protected]> > wrote: > >> Hi Nick, >> >> What do the nimbus and supervisor logs say? One or both may contain clues >> as to why your workers are not starting up. >> >> --John >> >> On Thu, Sep 3, 2015 at 4:44 AM, Matthias J. Sax <[email protected]> wrote: >> >>> I am currently working with version 0.11.0-SNAPSHOT and cannot observe >>> the behavior you describe. If I submit a sample topology with 1 spout >>> (dop=1) and 1 bolt (dop=10) connected via shuffle grouping and have 12 >>> supervisor available (each with 12 worker slots), each of the 11 >>> executors is running on a single worker of a single supervisor (host). >>> >>> I am not idea why you observe a different behavior... >>> >>> -Matthias >>> >>> On 09/03/2015 12:20 AM, Nick R. Katsipoulakis wrote: >>> > When I say co-locate, what I have seen in my experiments is the >>> following: >>> > >>> > If the executor's number can be served by workers on one node, the >>> > scheduler spawns all the executors in the workers of one node. I have >>> > also seen that behavior in that the default scheduler tries to fill up >>> > one node before provisioning an additional one for the topology. >>> > >>> > Going back to your following sentence "and the executors should be >>> > evenly distributed over all available workers." I have to say that I do >>> > not see that often in my experiments. Actually, I often come across >>> with >>> > workers handling 2 - 3 executors/tasks, and other doing nothing. Am I >>> > missing something? Is it just a coincidence that happened in my >>> experiments? >>> > >>> > Thank you, >>> > Nick >>> > >>> > >>> > >>> > 2015-09-02 17:38 GMT-04:00 Matthias J. Sax <[email protected] >>> > <mailto:[email protected]>>: >>> > >>> > I agree. The load is not high. >>> > >>> > About higher latencies. How many ackers did you configure? As a >>> rule of >>> > thumb there should be one acker per executor. If you have less >>> ackers, >>> > and an increasing number of executors, this might cause the >>> increased >>> > latency as the ackers could become a bottleneck. >>> > >>> > What do you mean by "trying to co-locate tasks and executors as >>> much as >>> > possible"? Tasks a logical units of works that are processed by >>> > executors (which are threads). Furthermore (as far as I know), the >>> > default scheduler does a evenly distributed assignment for tasks >>> and >>> > executor to the available workers. In you case, as you set the >>> number of >>> > task equal to the number of executors, each executors processes a >>> single >>> > task, and the executors should be evenly distributed over all >>> available >>> > workers. >>> > >>> > However, you are right: intra-worker channels are "cheaper" than >>> > inter-worker channels. In order to exploit this, you should use >>> > shuffle-or-local grouping instead of shuffle. The disadvantage of >>> > shuffle-or-local might be missing load-balancing. Shuffle always >>> ensures >>> > good load balancing. >>> > >>> > >>> > -Matthias >>> > >>> > >>> > >>> > On 09/02/2015 10:31 PM, Nick R. Katsipoulakis wrote: >>> > > Well, my input load is 4 streams at 4000 tuples per second, and >>> each >>> > > tuple is about 128 bytes long. Therefore, I do not think my load >>> is too >>> > > much for my hardware. >>> > > >>> > > No, I am running only this topology in my cluster. >>> > > >>> > > For some reason, when I set the task to executor ratio to 1, my >>> topology >>> > > does not hang at all. The strange thing now is that I see higher >>> latency >>> > > with more executors and I am trying to figure this out. Also, I >>> see that >>> > > the default scheduler is trying to co-locate tasks and executors >>> as much >>> > > as possible. Is this true? If yes, is it because the intra-worker >>> > > latencies are much lower than the inter-worker latencies? >>> > > >>> > > Thanks, >>> > > Nick >>> > > >>> > > 2015-09-02 16:27 GMT-04:00 Matthias J. Sax <[email protected] >>> <mailto:[email protected]> >>> > > <mailto:[email protected] <mailto:[email protected]>>>: >>> > > >>> > > So (for each node) you have 4 cores available for 1 >>> supervisor JVM, 2 >>> > > worker JVMs that execute up to 5 thread each (if 40 >>> executors are >>> > > distributed evenly over all workers. Thus, about 12 threads >>> for 4 cores. >>> > > Or course, Storm starts a few more threads within each >>> > > worker/supervisor. >>> > > >>> > > If your load is not huge, this might be sufficient. However, >>> having high >>> > > data rate, it might be problematic. >>> > > >>> > > One more question: do you run a single topology in your >>> cluster or >>> > > multiple? Storm isolates topologies for fault-tolerance >>> reasons. Thus, a >>> > > single worker cannot process executors from different >>> topologies. If you >>> > > run out of workers, a topology might not start up completely. >>> > > >>> > > -Matthias >>> > > >>> > > >>> > > >>> > > On 09/02/2015 09:54 PM, Nick R. Katsipoulakis wrote: >>> > > > Hello Matthias and thank you for your reply. See my >>> answers below: >>> > > > >>> > > > - I have a 4 supervisor nodes in my AWS cluster of >>> m4.xlarge instances >>> > > > (4 cores per node). On top of that I have 3 more nodes for >>> zookeeper and >>> > > > nimbus. >>> > > > - 2 worker nodes per supervisor node >>> > > > - The task number for each bolt ranges from 1 to 4 and I >>> use 1:1 task to >>> > > > executor assignment. >>> > > > - The number of executors in total for the topology ranges >>> from 14 to 41 >>> > > > >>> > > > Thanks, >>> > > > Nick >>> > > > >>> > > > 2015-09-02 15:42 GMT-04:00 Matthias J. Sax < >>> [email protected] <mailto:[email protected]> <mailto:[email protected] >>> > <mailto:[email protected]>> >>> > > > <mailto:[email protected] <mailto:[email protected]> >>> > <mailto:[email protected] <mailto:[email protected]>>>>: >>> > > > >>> > > > Without any exception/error message it is hard to tell. >>> > > > >>> > > > What is your cluster setup >>> > > > - Hardware, ie, number of cores per node? >>> > > > - How many node/supervisor are available? >>> > > > - Configured number of workers for the topology? >>> > > > - What is the number of task for each spout/bolt? >>> > > > - What is the number of executors for each >>> spout/bolt? >>> > > > >>> > > > -Matthias >>> > > > >>> > > > On 09/02/2015 08:02 PM, Nick R. Katsipoulakis wrote: >>> > > > > Hello all, >>> > > > > >>> > > > > I am working on a project in which I submit a >>> topology >>> > to my >>> > > Storm >>> > > > > cluster, but for some reason, some of my tasks do not >>> > start >>> > > executing. >>> > > > > >>> > > > > I can see that the above is happening because every >>> > bolt I have >>> > > > needs to >>> > > > > connect to an external server and do a registration >>> to a >>> > > service. >>> > > > > However, some of the bolts do not seem to connect. >>> > > > > >>> > > > > I have to say that the number of tasks I have is >>> > larger than the >>> > > > number >>> > > > > of workers of my cluster. Also, I check my worker log >>> > files, >>> > > and I see >>> > > > > that the workers that do not register, are also not >>> > writing some >>> > > > > initialization messages I have them print in the >>> > beginning. >>> > > > > >>> > > > > Any idea why this is happening? Can it be because my >>> > > resources are not >>> > > > > enough to start off all of the tasks? >>> > > > > >>> > > > > Thank you, >>> > > > > Nick >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > -- >>> > > > Nikolaos Romanos Katsipoulakis, >>> > > > University of Pittsburgh, PhD candidate >>> > > >>> > > >>> > > >>> > > >>> > > -- >>> > > Nikolaos Romanos Katsipoulakis, >>> > > University of Pittsburgh, PhD candidate >>> > >>> > >>> > >>> > >>> > -- >>> > Nikolaos Romanos Katsipoulakis, >>> > University of Pittsburgh, PhD candidate >>> >>> >> > > > -- > Regards, > Abhishek Agarwal > > -- Nikolaos Romanos Katsipoulakis, University of Pittsburgh, PhD candidate
