Spark Streaming with Kafka - batch DStreams in memory

p pathiyil Mon, 01 Feb 2016 20:12:07 -0800

Hi,

Are there any ways to store DStreams / RDD read from Kafka in memory to be
processed at a later time ? What we need to do is to read data from Kafka,
process it to be keyed by some attribute that is present in the Kafka
messages, and write out the data related to each key when we have
accumulated enough data for that key to write out a file that is close to
the HDFS block size, say 64MB. We are looking at ways to avoid writing out
some file of the entire Kafka content periodically and then later run a
second job to read those files and split them out to another set of files
as necessary.


Thanks.

Spark Streaming with Kafka - batch DStreams in memory

Reply via email to