The batch approach i had implemented takes about 10 minutes to complete all
the pre-computation tasks for the one hour worth of data. When i went
through my code, i figured out that most of the time consuming tasks are
the ones, which read data from cassandra and the places where i perform
sparkContex.union(Array[RDD]).
Now the ask is to get the pre computation tasks near real time. So i am
exploring the streaming approach.

My pre computation tasks not only include just finding the unique numbers
for a given device every minute, every hour, every day but it also includes
the following tasks:
1. Find the number of unique numbers across a set of devices every minute,
every hour, every day
2. Find the number of unique numbers which are commonly occurring across a
set of devices every minute, every hour, every day
3. Find (total time a number occurred across a set of devices)/(total
unique numbers occurred across the set of devices)
The above mentioned pre computation tasks are just a few of what i will be
needing and there are many more coming towards me :)
I see all these problems need more of data parallel approach and hence i am
interested to do this on the spark streaming end.


On Tue, Sep 15, 2015 at 11:04 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Why did you not stay with the batch approach? For me the architecture
> looks very complex for a simple thing you want to achieve. Why don't you
> process the data already in storm ?
>
> Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com>
> a écrit :
>
>> I am pretty new to spark. Please suggest a better model for the following
>> use case.
>>
>> I have few (about 1500) devices in field which keep emitting about 100KB
>> of data every minute. The nature of data sent by the devices is just a list
>> of numbers.
>> As of now, we have Storm is in the architecture which receives this data,
>> sanitizes it and writes to cassandra.
>> Now, i have a requirement to process this data. The processing includes
>> finding unique numbers emitted by one or more devices for every minute,
>> every hour, every day, every month.
>> I had implemented this processing part as a batch job execution and now i
>> am interested in making it a streaming application. i.e calculating the
>> processed data as and when devices emit the data.
>>
>> I have the following two approaches:
>> 1. Storm writes the actual data to cassandra and writes a message on
>> Kafka bus that data corresponding to device D and minute M has been written
>> to cassandra
>>
>> Then Spark streaming reads this message from kafka , then reads the data
>> of Device D at minute M from cassandra and starts processing the data.
>>
>> 2. Storm writes the data to both cassandra and  kafka, spark reads the
>> actual data from kafka , processes the data and writes to cassandra.
>> The second approach avoids additional hit of reading from cassandra every
>> minute , a device has written data to cassandra at the cost of putting the
>> actual heavy messages instead of light events on  kafka.
>>
>> I am a bit confused among the two approaches. Please suggest which one is
>> better and if both are bad, how can i handle this use case?
>>
>>
>> --
>> /Vamsi
>>
>


-- 
/Vamsi

Reply via email to