Re: Spark Streaming Suggestion

ayan guha Tue, 15 Sep 2015 14:01:10 -0700

I think you need to make up your mind about storm vs spark. Using both in
this context does not make much sense to me.
On 15 Sep 2015 22:54, "David Morales" <dmora...@stratio.com> wrote:


> Hi there,
>
> This is exactly our goal in Stratio Sparkta, a real-time aggregation
> engine fully developed with spark streaming (and fully open source).
>
> Take a look at:
>
>
>    - the docs: http://docs.stratio.com/modules/sparkta/development/
>    - the repository: https://github.com/Stratio/sparkta
>    - and some slides explaining how sparkta was born and what it makes:
>    http://www.slideshare.net/Stratio/strata-sparkta
>
>
> Feel free to ask us anything about the project.
>
>
>
>
>
>
>
>
> 2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>:
>
>> The batch approach i had implemented takes about 10 minutes to complete
>> all the pre-computation tasks for the one hour worth of data. When i went
>> through my code, i figured out that most of the time consuming tasks are
>> the ones, which read data from cassandra and the places where i perform
>> sparkContex.union(Array[RDD]).
>> Now the ask is to get the pre computation tasks near real time. So i am
>> exploring the streaming approach.
>>
>> My pre computation tasks not only include just finding the unique numbers
>> for a given device every minute, every hour, every day but it also includes
>> the following tasks:
>> 1. Find the number of unique numbers across a set of devices every
>> minute, every hour, every day
>> 2. Find the number of unique numbers which are commonly occurring across
>> a set of devices every minute, every hour, every day
>> 3. Find (total time a number occurred across a set of devices)/(total
>> unique numbers occurred across the set of devices)
>> The above mentioned pre computation tasks are just a few of what i will
>> be needing and there are many more coming towards me :)
>> I see all these problems need more of data parallel approach and hence i
>> am interested to do this on the spark streaming end.
>>
>>
>> On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> Why did you not stay with the batch approach? For me the architecture
>>> looks very complex for a simple thing you want to achieve. Why don't you
>>> process the data already in storm ?
>>>
>>> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com>
>>> a écrit :
>>>
>>>> I am pretty new to spark. Please suggest a better model for the
>>>> following use case.
>>>>
>>>> I have few (about 1500) devices in field which keep emitting about
>>>> 100KB of data every minute. The nature of data sent by the devices is just
>>>> a list of numbers.
>>>> As of now, we have Storm is in the architecture which receives this
>>>> data, sanitizes it and writes to cassandra.
>>>> Now, i have a requirement to process this data. The processing includes
>>>> finding unique numbers emitted by one or more devices for every minute,
>>>> every hour, every day, every month.
>>>> I had implemented this processing part as a batch job execution and now
>>>> i am interested in making it a streaming application. i.e calculating the
>>>> processed data as and when devices emit the data.
>>>>
>>>> I have the following two approaches:
>>>> 1. Storm writes the actual data to cassandra and writes a message on
>>>> Kafka bus that data corresponding to device D and minute M has been written
>>>> to cassandra
>>>>
>>>> Then Spark streaming reads this message from kafka , then reads the
>>>> data of Device D at minute M from cassandra and starts processing the data.
>>>>
>>>> 2. Storm writes the data to both cassandra and  kafka, spark reads the
>>>> actual data from kafka , processes the data and writes to cassandra.
>>>> The second approach avoids additional hit of reading from cassandra
>>>> every minute , a device has written data to cassandra at the cost of
>>>> putting the actual heavy messages instead of light events on  kafka.
>>>>
>>>> I am a bit confused among the two approaches. Please suggest which one
>>>> is better and if both are bad, how can i handle this use case?
>>>>
>>>>
>>>> --
>>>> /Vamsi
>>>>
>>>
>>
>>
>> --
>> /Vamsi
>>
>
>
>
> --
>
> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
> <https://twitter.com/dmoralesdf>
>
>
> <http://www.stratio.com/>
> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
> <https://twitter.com/StratioBD>*
>

Re: Spark Streaming Suggestion

Reply via email to