spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Andrew Psaltis
Mohit, I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. Are you wanting to process it and then put it into HDFS or just put the raw data into HDFS? If the later then why not just use Camus ( https://github.com/linkedin/camus), it will

Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke
Good questions, some of which I'd like to know the answer to. Is it okay to update a NoSQL DB with aggregated counts per batch interval or is it generally stored in hdfs? This depends on how you are going to use the aggregate data. 1. Is there a lot of data? If so, and you are going to use

spark streaming from kafka real time + batch processing in java

2015-02-05 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to process the data in real-time as well as store the data in hdfs in year/month/day/hour/ format. I am not sure how to achieve this. Should I write separate kafka consumers, one for writing data to HDFS and one for spark