Re: Spark Streaming Suggestion

David Morales Tue, 15 Sep 2015 05:55:37 -0700

Hi there,

This is exactly our goal in Stratio Sparkta, a real-time aggregation engine
fully developed with spark streaming (and fully open source).


Take a look at:


   - the docs: http://docs.stratio.com/modules/sparkta/development/
   - the repository: https://github.com/Stratio/sparkta
   - and some slides explaining how sparkta was born and what it makes:
   http://www.slideshare.net/Stratio/strata-sparkta


Feel free to ask us anything about the project.








2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>:

> The batch approach i had implemented takes about 10 minutes to complete
> all the pre-computation tasks for the one hour worth of data. When i went
> through my code, i figured out that most of the time consuming tasks are
> the ones, which read data from cassandra and the places where i perform
> sparkContex.union(Array[RDD]).
> Now the ask is to get the pre computation tasks near real time. So i am
> exploring the streaming approach.
>
> My pre computation tasks not only include just finding the unique numbers
> for a given device every minute, every hour, every day but it also includes
> the following tasks:
> 1. Find the number of unique numbers across a set of devices every minute,
> every hour, every day
> 2. Find the number of unique numbers which are commonly occurring across a
> set of devices every minute, every hour, every day
> 3. Find (total time a number occurred across a set of devices)/(total
> unique numbers occurred across the set of devices)
> The above mentioned pre computation tasks are just a few of what i will be
> needing and there are many more coming towards me :)
> I see all these problems need more of data parallel approach and hence i
> am interested to do this on the spark streaming end.
>
>
> On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> Why did you not stay with the batch approach? For me the architecture
>> looks very complex for a simple thing you want to achieve. Why don't you
>> process the data already in storm ?
>>
>> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com>
>> a écrit :
>>
>>> I am pretty new to spark. Please suggest a better model for the
>>> following use case.
>>>
>>> I have few (about 1500) devices in field which keep emitting about 100KB
>>> of data every minute. The nature of data sent by the devices is just a list
>>> of numbers.
>>> As of now, we have Storm is in the architecture which receives this
>>> data, sanitizes it and writes to cassandra.
>>> Now, i have a requirement to process this data. The processing includes
>>> finding unique numbers emitted by one or more devices for every minute,
>>> every hour, every day, every month.
>>> I had implemented this processing part as a batch job execution and now
>>> i am interested in making it a streaming application. i.e calculating the
>>> processed data as and when devices emit the data.
>>>
>>> I have the following two approaches:
>>> 1. Storm writes the actual data to cassandra and writes a message on
>>> Kafka bus that data corresponding to device D and minute M has been written
>>> to cassandra
>>>
>>> Then Spark streaming reads this message from kafka , then reads the data
>>> of Device D at minute M from cassandra and starts processing the data.
>>>
>>> 2. Storm writes the data to both cassandra and  kafka, spark reads the
>>> actual data from kafka , processes the data and writes to cassandra.
>>> The second approach avoids additional hit of reading from cassandra
>>> every minute , a device has written data to cassandra at the cost of
>>> putting the actual heavy messages instead of light events on  kafka.
>>>
>>> I am a bit confused among the two approaches. Please suggest which one
>>> is better and if both are bad, how can i handle this use case?
>>>
>>>
>>> --
>>> /Vamsi
>>>
>>
>
>
> --
> /Vamsi
>



-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
<https://twitter.com/StratioBD>*

Re: Spark Streaming Suggestion

Reply via email to