great question, wei.  this is very important to understand from a
performance perspective.  and this extends is beyond kinesis - it's for any
streaming source that supports shards/partitions.

i need to do a little research into the internals to confirm my theory.

lemme get back to you!

-chris


On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu <wei....@stellarloyalty.com>
wrote:

> We are exploring using Kinesis and spark streaming together. I took at a
> look at the kinesis receiver code in 1.1.0. I have a question regarding
> kinesis partition & spark streaming partition. It seems to be pretty
> difficult to align these partitions.
>
> Kinesis partitions a stream of data into shards, if we follow the example,
> we will have multiple kinesis receivers reading from the same stream in
> spark streaming. It seems like kinesis workers will coordinate among
> themselves and assign shards to themselves dynamically. For a particular
> shard, it can be consumed by different kinesis workers (thus different
> spark workers) dynamically (not at the same time). Blocks are generated
> based on time intervals, RDD are created based on blocks. RDDs are
> partitioned based on blocks. At the end, the data for a given shard will be
> spread into multiple blocks (possible located on different spark worker
> nodes).
>
> We will probably need to group these data again for a given shard and
> shuffle data around to achieve the same partition we had in Kinesis.
>
> Is there a better way to achieve this to avoid data reshuffling?
>
> Thanks,
> Wei
>

Reply via email to