Chris, Think I will check back with you to see if you made progress on this issue. Any good news so far? Thanks. Once again, I really appreciate you look into this issue.
Thanks, Wei On Thu, Aug 28, 2014 at 4:44 PM, Chris Fregly <ch...@fregly.com> wrote: > great question, wei. this is very important to understand from a > performance perspective. and this extends is beyond kinesis - it's for any > streaming source that supports shards/partitions. > > i need to do a little research into the internals to confirm my theory. > > lemme get back to you! > > -chris > > > On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu <wei....@stellarloyalty.com> > wrote: > >> We are exploring using Kinesis and spark streaming together. I took at a >> look at the kinesis receiver code in 1.1.0. I have a question regarding >> kinesis partition & spark streaming partition. It seems to be pretty >> difficult to align these partitions. >> >> Kinesis partitions a stream of data into shards, if we follow the >> example, we will have multiple kinesis receivers reading from the same >> stream in spark streaming. It seems like kinesis workers will coordinate >> among themselves and assign shards to themselves dynamically. For a >> particular shard, it can be consumed by different kinesis workers (thus >> different spark workers) dynamically (not at the same time). Blocks are >> generated based on time intervals, RDD are created based on blocks. RDDs >> are partitioned based on blocks. At the end, the data for a given shard >> will be spread into multiple blocks (possible located on different spark >> worker nodes). >> >> We will probably need to group these data again for a given shard and >> shuffle data around to achieve the same partition we had in Kinesis. >> >> Is there a better way to achieve this to avoid data reshuffling? >> >> Thanks, >> Wei >> > >