Re: spark streaming from kafka real time + batch processing in java

Andrew Psaltis Fri, 06 Feb 2015 08:00:51 -0800

Mohit,

>I want to process the data in real-time as well as store the data in hdfs
in year/month/day/hour/ format.
Are you wanting to process it and then put it into HDFS or just put the raw
data into HDFS? If the later then why not just use Camus (
https://github.com/linkedin/camus), it will easily put the data into the
directory structure you are after.


On Fri, Feb 6, 2015 at 12:19 AM, Mohit Durgapal <durgapalmo...@gmail.com>
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
> Mohit
>

Re: spark streaming from kafka real time + batch processing in java

Reply via email to