I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark streaming?

Also I would like to ask what do people generally do with the result of
spark streams after aggregating over it? Is it okay to update a NoSQL DB
with aggregated counts per batch interval or is it generally stored in hdfs?

Is it possible to store the mini batch data from spark streaming to HDFS in
a way that the data is aggregated  hourly and put into HDFS in its "hour"
folder. I would not want a lot of small files equal to the mini batches of
spark per hour, that would be inefficient for running hadoop jobs later.

Is anyone working on the same problem?

Any help and comments would be great.


Regards

Mohit

Reply via email to