Re: Spark Streaming with Kafka - batch DStreams in memory

2016-02-02 Thread Cody Koeninger
It's possible you could (ab)use updateStateByKey or mapWithState for this.

But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
file problem.

On Mon, Feb 1, 2016 at 10:11 PM, p pathiyil  wrote:

> Hi,
>
> Are there any ways to store DStreams / RDD read from Kafka in memory to be
> processed at a later time ? What we need to do is to read data from Kafka,
> process it to be keyed by some attribute that is present in the Kafka
> messages, and write out the data related to each key when we have
> accumulated enough data for that key to write out a file that is close to
> the HDFS block size, say 64MB. We are looking at ways to avoid writing out
> some file of the entire Kafka content periodically and then later run a
> second job to read those files and split them out to another set of files
> as necessary.
>
> Thanks.
>


Spark Streaming with Kafka - batch DStreams in memory

2016-02-01 Thread p pathiyil
Hi,

Are there any ways to store DStreams / RDD read from Kafka in memory to be
processed at a later time ? What we need to do is to read data from Kafka,
process it to be keyed by some attribute that is present in the Kafka
messages, and write out the data related to each key when we have
accumulated enough data for that key to write out a file that is close to
the HDFS block size, say 64MB. We are looking at ways to avoid writing out
some file of the entire Kafka content periodically and then later run a
second job to read those files and split them out to another set of files
as necessary.

Thanks.