We are exploring using Kinesis and spark streaming together. I took at a
look at the kinesis receiver code in 1.1.0. I have a question regarding
kinesis partition & spark streaming partition. It seems to be pretty
difficult to align these partitions.

Kinesis partitions a stream of data into shards, if we follow the example,
we will have multiple kinesis receivers reading from the same stream in
spark streaming. It seems like kinesis workers will coordinate among
themselves and assign shards to themselves dynamically. For a particular
shard, it can be consumed by different kinesis workers (thus different
spark workers) dynamically (not at the same time). Blocks are generated
based on time intervals, RDD are created based on blocks. RDDs are
partitioned based on blocks. At the end, the data for a given shard will be
spread into multiple blocks (possible located on different spark worker
nodes).

We will probably need to group these data again for a given shard and
shuffle data around to achieve the same partition we had in Kinesis.

Is there a better way to achieve this to avoid data reshuffling?

Thanks,
Wei

Reply via email to