I think you need to make up your mind about storm vs spark. Using both in this context does not make much sense to me. On 15 Sep 2015 22:54, "David Morales" <dmora...@stratio.com> wrote:
> Hi there, > > This is exactly our goal in Stratio Sparkta, a real-time aggregation > engine fully developed with spark streaming (and fully open source). > > Take a look at: > > > - the docs: http://docs.stratio.com/modules/sparkta/development/ > - the repository: https://github.com/Stratio/sparkta > - and some slides explaining how sparkta was born and what it makes: > http://www.slideshare.net/Stratio/strata-sparkta > > > Feel free to ask us anything about the project. > > > > > > > > > 2015-09-15 8:10 GMT+02:00 srungarapu vamsi <srungarapu1...@gmail.com>: > >> The batch approach i had implemented takes about 10 minutes to complete >> all the pre-computation tasks for the one hour worth of data. When i went >> through my code, i figured out that most of the time consuming tasks are >> the ones, which read data from cassandra and the places where i perform >> sparkContex.union(Array[RDD]). >> Now the ask is to get the pre computation tasks near real time. So i am >> exploring the streaming approach. >> >> My pre computation tasks not only include just finding the unique numbers >> for a given device every minute, every hour, every day but it also includes >> the following tasks: >> 1. Find the number of unique numbers across a set of devices every >> minute, every hour, every day >> 2. Find the number of unique numbers which are commonly occurring across >> a set of devices every minute, every hour, every day >> 3. Find (total time a number occurred across a set of devices)/(total >> unique numbers occurred across the set of devices) >> The above mentioned pre computation tasks are just a few of what i will >> be needing and there are many more coming towards me :) >> I see all these problems need more of data parallel approach and hence i >> am interested to do this on the spark streaming end. >> >> >> On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> Why did you not stay with the batch approach? For me the architecture >>> looks very complex for a simple thing you want to achieve. Why don't you >>> process the data already in storm ? >>> >>> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com> >>> a écrit : >>> >>>> I am pretty new to spark. Please suggest a better model for the >>>> following use case. >>>> >>>> I have few (about 1500) devices in field which keep emitting about >>>> 100KB of data every minute. The nature of data sent by the devices is just >>>> a list of numbers. >>>> As of now, we have Storm is in the architecture which receives this >>>> data, sanitizes it and writes to cassandra. >>>> Now, i have a requirement to process this data. The processing includes >>>> finding unique numbers emitted by one or more devices for every minute, >>>> every hour, every day, every month. >>>> I had implemented this processing part as a batch job execution and now >>>> i am interested in making it a streaming application. i.e calculating the >>>> processed data as and when devices emit the data. >>>> >>>> I have the following two approaches: >>>> 1. Storm writes the actual data to cassandra and writes a message on >>>> Kafka bus that data corresponding to device D and minute M has been written >>>> to cassandra >>>> >>>> Then Spark streaming reads this message from kafka , then reads the >>>> data of Device D at minute M from cassandra and starts processing the data. >>>> >>>> 2. Storm writes the data to both cassandra and kafka, spark reads the >>>> actual data from kafka , processes the data and writes to cassandra. >>>> The second approach avoids additional hit of reading from cassandra >>>> every minute , a device has written data to cassandra at the cost of >>>> putting the actual heavy messages instead of light events on kafka. >>>> >>>> I am a bit confused among the two approaches. Please suggest which one >>>> is better and if both are bad, how can i handle this use case? >>>> >>>> >>>> -- >>>> /Vamsi >>>> >>> >> >> >> -- >> /Vamsi >> > > > > -- > > David Morales de Frías :: +34 607 010 411 :: @dmoralesdf > <https://twitter.com/dmoralesdf> > > > <http://www.stratio.com/> > Vía de las dos Castillas, 33, Ática 4, 3ª Planta > 28224 Pozuelo de Alarcón, Madrid > Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd > <https://twitter.com/StratioBD>* >