Why did you not stay with the batch approach? For me the architecture looks
very complex for a simple thing you want to achieve. Why don't you process
the data already in storm ?

Le mar. 15 sept. 2015 à 6:20, srungarapu vamsi <srungarapu1...@gmail.com> a
écrit :

> I am pretty new to spark. Please suggest a better model for the following
> use case.
>
> I have few (about 1500) devices in field which keep emitting about 100KB
> of data every minute. The nature of data sent by the devices is just a list
> of numbers.
> As of now, we have Storm is in the architecture which receives this data,
> sanitizes it and writes to cassandra.
> Now, i have a requirement to process this data. The processing includes
> finding unique numbers emitted by one or more devices for every minute,
> every hour, every day, every month.
> I had implemented this processing part as a batch job execution and now i
> am interested in making it a streaming application. i.e calculating the
> processed data as and when devices emit the data.
>
> I have the following two approaches:
> 1. Storm writes the actual data to cassandra and writes a message on Kafka
> bus that data corresponding to device D and minute M has been written to
> cassandra
>
> Then Spark streaming reads this message from kafka , then reads the data
> of Device D at minute M from cassandra and starts processing the data.
>
> 2. Storm writes the data to both cassandra and  kafka, spark reads the
> actual data from kafka , processes the data and writes to cassandra.
> The second approach avoids additional hit of reading from cassandra every
> minute , a device has written data to cassandra at the cost of putting the
> actual heavy messages instead of light events on  kafka.
>
> I am a bit confused among the two approaches. Please suggest which one is
> better and if both are bad, how can i handle this use case?
>
>
> --
> /Vamsi
>

Reply via email to