Mohit, >I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. Are you wanting to process it and then put it into HDFS or just put the raw data into HDFS? If the later then why not just use Camus ( https://github.com/linkedin/camus), it will easily put the data into the directory structure you are after.
On Fri, Feb 6, 2015 at 12:19 AM, Mohit Durgapal <durgapalmo...@gmail.com> wrote: > I want to write a spark streaming consumer for kafka in java. I want to > process the data in real-time as well as store the data in hdfs in > year/month/day/hour/ format. I am not sure how to achieve this. Should I > write separate kafka consumers, one for writing data to HDFS and one for > spark streaming? > > Also I would like to ask what do people generally do with the result of > spark streams after aggregating over it? Is it okay to update a NoSQL DB > with aggregated counts per batch interval or is it generally stored in hdfs? > > Is it possible to store the mini batch data from spark streaming to HDFS > in a way that the data is aggregated hourly and put into HDFS in its > "hour" folder. I would not want a lot of small files equal to the mini > batches of spark per hour, that would be inefficient for running hadoop > jobs later. > > Is anyone working on the same problem? > > Any help and comments would be great. > > > Regards > Mohit >