@David, I am going through the articles you have shared. Will message you
if i need any hellp. Thanks

@Ayan,  Yes, it looks like i can get every thing done with spark streaming.
In fact we have storm already in the architecture sanitizing the data and
dumping into cassandra. Now, i got some new requirements for which spark
streaming is the right tool. Just wanted to see if there can be smooth
marriage between existing storm and spark streaming.

Thanks for the inputs.

On Wed, Sep 16, 2015 at 2:30 AM, ayan guha <guha.a...@gmail.com> wrote:

> I think you need to make up your mind about storm vs spark. Using both in
> this context does not make much sense to me.
> On 15 Sep 2015 22:54, "David Morales" <dmora...@stratio.com> wrote:
>
>> Hi there,
>>
>> This is exactly our goal in Stratio Sparkta, a real-time aggregation
>> engine fully developed with spark streaming (and fully open source).
>>
>> Take a look at:
>>
>>
>>    - the docs: http://docs.stratio.com/modules/sparkta/development/
>>    - the repository: https://github.com/Stratio/sparkta
>>    - and some slides explaining how sparkta was born and what it makes:
>>    http://www.slideshare.net/Stratio/strata-sparkta
>>
>>
>> Feel free to ask us anything about the project.
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>:
>>
>>> The batch approach i had implemented takes about 10 minutes to complete
>>> all the pre-computation tasks for the one hour worth of data. When i went
>>> through my code, i figured out that most of the time consuming tasks are
>>> the ones, which read data from cassandra and the places where i perform
>>> sparkContex.union(Array[RDD]).
>>> Now the ask is to get the pre computation tasks near real time. So i am
>>> exploring the streaming approach.
>>>
>>> My pre computation tasks not only include just finding the unique
>>> numbers for a given device every minute, every hour, every day but it also
>>> includes the following tasks:
>>> 1. Find the number of unique numbers across a set of devices every
>>> minute, every hour, every day
>>> 2. Find the number of unique numbers which are commonly occurring across
>>> a set of devices every minute, every hour, every day
>>> 3. Find (total time a number occurred across a set of devices)/(total
>>> unique numbers occurred across the set of devices)
>>> The above mentioned pre computation tasks are just a few of what i will
>>> be needing and there are many more coming towards me :)
>>> I see all these problems need more of data parallel approach and hence i
>>> am interested to do this on the spark streaming end.
>>>
>>>
>>> On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Why did you not stay with the batch approach? For me the architecture
>>>> looks very complex for a simple thing you want to achieve. Why don't you
>>>> process the data already in storm ?
>>>>
>>>> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <
>>>> srungarapu1...@gmail.com> a écrit :
>>>>
>>>>> I am pretty new to spark. Please suggest a better model for the
>>>>> following use case.
>>>>>
>>>>> I have few (about 1500) devices in field which keep emitting about
>>>>> 100KB of data every minute. The nature of data sent by the devices is just
>>>>> a list of numbers.
>>>>> As of now, we have Storm is in the architecture which receives this
>>>>> data, sanitizes it and writes to cassandra.
>>>>> Now, i have a requirement to process this data. The processing
>>>>> includes finding unique numbers emitted by one or more devices for every
>>>>> minute, every hour, every day, every month.
>>>>> I had implemented this processing part as a batch job execution and
>>>>> now i am interested in making it a streaming application. i.e calculating
>>>>> the processed data as and when devices emit the data.
>>>>>
>>>>> I have the following two approaches:
>>>>> 1. Storm writes the actual data to cassandra and writes a message on
>>>>> Kafka bus that data corresponding to device D and minute M has been 
>>>>> written
>>>>> to cassandra
>>>>>
>>>>> Then Spark streaming reads this message from kafka , then reads the
>>>>> data of Device D at minute M from cassandra and starts processing the 
>>>>> data.
>>>>>
>>>>> 2. Storm writes the data to both cassandra and  kafka, spark reads the
>>>>> actual data from kafka , processes the data and writes to cassandra.
>>>>> The second approach avoids additional hit of reading from cassandra
>>>>> every minute , a device has written data to cassandra at the cost of
>>>>> putting the actual heavy messages instead of light events on  kafka.
>>>>>
>>>>> I am a bit confused among the two approaches. Please suggest which one
>>>>> is better and if both are bad, how can i handle this use case?
>>>>>
>>>>>
>>>>> --
>>>>> /Vamsi
>>>>>
>>>>
>>>
>>>
>>> --
>>> /Vamsi
>>>
>>
>>
>>
>> --
>>
>> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
>> <https://twitter.com/dmoralesdf>
>>
>>
>> <http://www.stratio.com/>
>> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
>> 28224 Pozuelo de Alarcón, Madrid
>> Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
>> <https://twitter.com/StratioBD>*
>>
>


-- 
/Vamsi

Reply via email to