great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions.
i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu <wei....@stellarloyalty.com> wrote: > We are exploring using Kinesis and spark streaming together. I took at a > look at the kinesis receiver code in 1.1.0. I have a question regarding > kinesis partition & spark streaming partition. It seems to be pretty > difficult to align these partitions. > > Kinesis partitions a stream of data into shards, if we follow the example, > we will have multiple kinesis receivers reading from the same stream in > spark streaming. It seems like kinesis workers will coordinate among > themselves and assign shards to themselves dynamically. For a particular > shard, it can be consumed by different kinesis workers (thus different > spark workers) dynamically (not at the same time). Blocks are generated > based on time intervals, RDD are created based on blocks. RDDs are > partitioned based on blocks. At the end, the data for a given shard will be > spread into multiple blocks (possible located on different spark worker > nodes). > > We will probably need to group these data again for a given shard and > shuffle data around to achieve the same partition we had in Kinesis. > > Is there a better way to achieve this to avoid data reshuffling? > > Thanks, > Wei >