I am pretty new to spark. Please suggest a better model for the following use case.
I have few (about 1500) devices in field which keep emitting about 100KB of data every minute. The nature of data sent by the devices is just a list of numbers. As of now, we have Storm is in the architecture which receives this data, sanitizes it and writes to cassandra. Now, i have a requirement to process this data. The processing includes finding unique numbers emitted by one or more devices for every minute, every hour, every day, every month. I had implemented this processing part as a batch job execution and now i am interested in making it a streaming application. i.e calculating the processed data as and when devices emit the data. I have the following two approaches: 1. Storm writes the actual data to cassandra and writes a message on Kafka bus that data corresponding to device D and minute M has been written to cassandra Then Spark streaming reads this message from kafka , then reads the data of Device D at minute M from cassandra and starts processing the data. 2. Storm writes the data to both cassandra and kafka, spark reads the actual data from kafka , processes the data and writes to cassandra. The second approach avoids additional hit of reading from cassandra every minute , a device has written data to cassandra at the cost of putting the actual heavy messages instead of light events on kafka. I am a bit confused among the two approaches. Please suggest which one is better and if both are bad, how can i handle this use case? -- /Vamsi