Re: Syncing multiple streams to compute final result from a bolt

Harsh Choudhary Tue, 20 Sep 2016 23:27:32 -0700

It is real-time. I get streaming JSONs from Kafka.



On Wed, Sep 21, 2016 at 4:15 AM, Ambud Sharma <asharma52...@gmail.com>
wrote:

> Is this real-time or batch?
>
> If batch this is perfect for MapReduce or Spark.
>
> If real-time then you should use Spark or Storm Trident.
>
> On Sep 20, 2016 9:39 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:
>
>> My use case is that I have a json which contains an array. I need to
>> split that array into multiple jsons and do some computations on them.
>> After that, results from each json has to be used in further calculation
>> altogether and come up with the final result.
>>
>> *Cheers!*
>>
>> Harsh Choudhary / Software Engineer
>>
>> Blog / express.harshti.me
>>
>> [image: Facebook] <https://facebook.com/shry.harsh> [image: Twitter]
>> <https://twitter.com/har_ssh> [image: Google Plus]
>> <https://plus.google.com/107567038912927268680>
>> <https://in.linkedin.com/in/choudharyharsh> [image: Linkedin]
>> <https://in.linkedin.com/in/choudharyharsh> [image: Instagram]
>> <https://instagram.com/harsh.choudhary>
>> <https://www.pinterest.com/shryharsh/>[image: 500px]
>> <https://500px.com/harshchoudhary> [image: github]
>> <https://github.com/shry15harsh>
>>
>> On Tue, Sep 20, 2016 at 9:09 PM, Ambud Sharma <asharma52...@gmail.com>
>> wrote:
>>
>>> What's your use case?
>>>
>>> The complexities can be manage d as long as your tuple branching is
>>> reasonable I.e. 1 tuple creates several other tuples and you need to sync
>>> results between them.
>>>
>>> On Sep 20, 2016 8:19 AM, "Harsh Choudhary" <shry.ha...@gmail.com> wrote:
>>>
>>>> You're right. For that I have to manage the queue and all those
>>>> complexities of timeout. If Storm is not the right place to do this then
>>>> what else?
>>>>
>>>>
>>>>
>>>> On Tue, Sep 20, 2016 at 8:25 PM, Ambud Sharma <asharma52...@gmail.com>
>>>> wrote:
>>>>
>>>>> The correct way is to perform time window aggregation using bucketing.
>>>>>
>>>>> Use the timestamp on your event computed from.various stages and send
>>>>> it to a single bolt where the aggregation happens. You only emit from this
>>>>> bolt once you receive results from both parts.
>>>>>
>>>>> It's like creating a barrier or the join phase of a fork join pool.
>>>>>
>>>>> That said the more important question is is Storm the right place do
>>>>> to this? When you perform time window aggregation you are susceptible to
>>>>> tuple timeouts and have to also deal with making sure your aggregation is
>>>>> idempotent.
>>>>>
>>>>> On Sep 20, 2016 7:49 AM, "Harsh Choudhary" <shry.ha...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> But how would that solve the syncing problem?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 20, 2016 at 8:12 PM, Alberto São Marcos <
>>>>>> alberto....@gmail.com> wrote:
>>>>>>
>>>>>>> I would dump the *Bolt-A* results in a shared-data-store/queue and
>>>>>>> have a separate workflow with another spout and Bolt-B draining from 
>>>>>>> there
>>>>>>>
>>>>>>> On Tue, Sep 20, 2016 at 9:20 AM, Harsh Choudhary <
>>>>>>> shry.ha...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I am thinking of doing the following.
>>>>>>>>
>>>>>>>> Spout subscribed to Kafka and get JSONs. Spout emits the JSONs as
>>>>>>>> individual tuples.
>>>>>>>>
>>>>>>>> Bolt-A has subscribed to the spout. Bolt-A creates multiple JSONs
>>>>>>>> from a json and emits them as multiple streams.
>>>>>>>>
>>>>>>>> Bolt-B receives these streams and do the computation on them.
>>>>>>>>
>>>>>>>> I need to make a cumulative result from all the multiple JSONs
>>>>>>>> (which are emerged from a single JSON) in a Bolt. But a bolt static
>>>>>>>> instance variable is only shared between tasks per worker. How do 
>>>>>>>> achieve
>>>>>>>> this syncing process.
>>>>>>>>
>>>>>>>>                               --->
>>>>>>>> Spout ---> Bolt-A   --->   Bolt-B  ---> Final result
>>>>>>>>                               --->
>>>>>>>>
>>>>>>>> The final result is per JSON which was read from Kafka.
>>>>>>>>
>>>>>>>> Or is there any other way to achieve this better?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>

Re: Syncing multiple streams to compute final result from a bolt

Reply via email to