Hi there, This is exactly our goal in Stratio Sparkta, a real-time aggregation engine fully developed with spark streaming (and fully open source).
Take a look at: - the docs: http://docs.stratio.com/modules/sparkta/development/ - the repository: https://github.com/Stratio/sparkta - and some slides explaining how sparkta was born and what it makes: http://www.slideshare.net/Stratio/strata-sparkta Feel free to ask us anything about the project. 2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>: > The batch approach i had implemented takes about 10 minutes to complete > all the pre-computation tasks for the one hour worth of data. When i went > through my code, i figured out that most of the time consuming tasks are > the ones, which read data from cassandra and the places where i perform > sparkContex.union(Array[RDD]). > Now the ask is to get the pre computation tasks near real time. So i am > exploring the streaming approach. > > My pre computation tasks not only include just finding the unique numbers > for a given device every minute, every hour, every day but it also includes > the following tasks: > 1. Find the number of unique numbers across a set of devices every minute, > every hour, every day > 2. Find the number of unique numbers which are commonly occurring across a > set of devices every minute, every hour, every day > 3. Find (total time a number occurred across a set of devices)/(total > unique numbers occurred across the set of devices) > The above mentioned pre computation tasks are just a few of what i will be > needing and there are many more coming towards me :) > I see all these problems need more of data parallel approach and hence i > am interested to do this on the spark streaming end. > > > On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> Why did you not stay with the batch approach? For me the architecture >> looks very complex for a simple thing you want to achieve. Why don't you >> process the data already in storm ? >> >> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com> >> a écrit : >> >>> I am pretty new to spark. Please suggest a better model for the >>> following use case. >>> >>> I have few (about 1500) devices in field which keep emitting about 100KB >>> of data every minute. The nature of data sent by the devices is just a list >>> of numbers. >>> As of now, we have Storm is in the architecture which receives this >>> data, sanitizes it and writes to cassandra. >>> Now, i have a requirement to process this data. The processing includes >>> finding unique numbers emitted by one or more devices for every minute, >>> every hour, every day, every month. >>> I had implemented this processing part as a batch job execution and now >>> i am interested in making it a streaming application. i.e calculating the >>> processed data as and when devices emit the data. >>> >>> I have the following two approaches: >>> 1. Storm writes the actual data to cassandra and writes a message on >>> Kafka bus that data corresponding to device D and minute M has been written >>> to cassandra >>> >>> Then Spark streaming reads this message from kafka , then reads the data >>> of Device D at minute M from cassandra and starts processing the data. >>> >>> 2. Storm writes the data to both cassandra and kafka, spark reads the >>> actual data from kafka , processes the data and writes to cassandra. >>> The second approach avoids additional hit of reading from cassandra >>> every minute , a device has written data to cassandra at the cost of >>> putting the actual heavy messages instead of light events on kafka. >>> >>> I am a bit confused among the two approaches. Please suggest which one >>> is better and if both are bad, how can i handle this use case? >>> >>> >>> -- >>> /Vamsi >>> >> > > > -- > /Vamsi > -- David Morales de Frías :: +34 607 010 411 :: @dmoralesdf <https://twitter.com/dmoralesdf> <http://www.stratio.com/> Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd <https://twitter.com/StratioBD>*