We are exploring using Kinesis and spark streaming together. I took at a look at the kinesis receiver code in 1.1.0. I have a question regarding kinesis partition & spark streaming partition. It seems to be pretty difficult to align these partitions.
Kinesis partitions a stream of data into shards, if we follow the example, we will have multiple kinesis receivers reading from the same stream in spark streaming. It seems like kinesis workers will coordinate among themselves and assign shards to themselves dynamically. For a particular shard, it can be consumed by different kinesis workers (thus different spark workers) dynamically (not at the same time). Blocks are generated based on time intervals, RDD are created based on blocks. RDDs are partitioned based on blocks. At the end, the data for a given shard will be spread into multiple blocks (possible located on different spark worker nodes). We will probably need to group these data again for a given shard and shuffle data around to achieve the same partition we had in Kinesis. Is there a better way to achieve this to avoid data reshuffling? Thanks, Wei