It is real-time. I get streaming JSONs from Kafka.
On Wed, Sep 21, 2016 at 4:15 AM, Ambud Sharma <asharma52...@gmail.com> wrote: > Is this real-time or batch? > > If batch this is perfect for MapReduce or Spark. > > If real-time then you should use Spark or Storm Trident. > > On Sep 20, 2016 9:39 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote: > >> My use case is that I have a json which contains an array. I need to >> split that array into multiple jsons and do some computations on them. >> After that, results from each json has to be used in further calculation >> altogether and come up with the final result. >> >> *Cheers!* >> >> Harsh Choudhary / Software Engineer >> >> Blog / express.harshti.me >> >> [image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter] >> <https://twitter.com/har_ssh> [image: Google Plus] >> <https://plus.google.com/107567038912927268680> >> <https://in.linkedin.com/in/choudharyharsh> [image: Linkedin] >> <https://in.linkedin.com/in/choudharyharsh> [image: Instagram] >> <https://instagram.com/harsh.choudhary> >> <https://www.pinterest.com/shryharsh/>[image: 500px] >> <https://500px.com/harshchoudhary> [image: github] >> <https://github.com/shry15harsh> >> >> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <asharma52...@gmail.com> >> wrote: >> >>> What's your use case? >>> >>> The complexities can be manage d as long as your tuple branching is >>> reasonable I.e. 1 tuple creates several other tuples and you need to sync >>> results between them. >>> >>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote: >>> >>>> You're right. For that I have to manage the queue and all those >>>> complexities of timeout. If Storm is not the right place to do this then >>>> what else? >>>> >>>> >>>> >>>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <asharma52...@gmail.com> >>>> wrote: >>>> >>>>> The correct way is to perform time window aggregation using bucketing. >>>>> >>>>> Use the timestamp on your event computed from.various stages and send >>>>> it to a single bolt where the aggregation happens. You only emit from this >>>>> bolt once you receive results from both parts. >>>>> >>>>> It's like creating a barrier or the join phase of a fork join pool. >>>>> >>>>> That said the more important question is is Storm the right place do >>>>> to this? When you perform time window aggregation you are susceptible to >>>>> tuple timeouts and have to also deal with making sure your aggregation is >>>>> idempotent. >>>>> >>>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com> >>>>> wrote: >>>>> >>>>>> But how would that solve the syncing problem? >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos < >>>>>> alberto....@gmail.com> wrote: >>>>>> >>>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and >>>>>>> have a separate workflow with another spout and Bolt-B draining from >>>>>>> there >>>>>>> >>>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary < >>>>>>> shry.ha...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> I am thinking of doing the following. >>>>>>>> >>>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as >>>>>>>> individual tuples. >>>>>>>> >>>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs >>>>>>>> from a json and emits them as multiple streams. >>>>>>>> >>>>>>>> Bolt-B receives these streams and do the computation on them. >>>>>>>> >>>>>>>> I need to make a cumulative result from all the multiple JSONs >>>>>>>> (which are emerged from a single JSON) in a Bolt. But a bolt static >>>>>>>> instance variable is only shared between tasks per worker. How do >>>>>>>> achieve >>>>>>>> this syncing process. >>>>>>>> >>>>>>>> ---> >>>>>>>> Spout ---> Bolt-A ---> Bolt-B ---> Final result >>>>>>>> ---> >>>>>>>> >>>>>>>> The final result is per JSON which was read from Kafka. >>>>>>>> >>>>>>>> Or is there any other way to achieve this better? >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>