Chris,

Think I will check back with you to see if you made progress on this issue.
Any good news so far? Thanks. Once again, I really appreciate you look into
this issue.

Thanks,
Wei

On Thu, Aug 28, 2014 at 4:44 PM, Chris Fregly <ch...@fregly.com> wrote:

> great question, wei.  this is very important to understand from a
> performance perspective.  and this extends is beyond kinesis - it's for any
> streaming source that supports shards/partitions.
>
> i need to do a little research into the internals to confirm my theory.
>
> lemme get back to you!
>
> -chris
>
>
> On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu <wei....@stellarloyalty.com>
> wrote:
>
>> We are exploring using Kinesis and spark streaming together. I took at a
>> look at the kinesis receiver code in 1.1.0. I have a question regarding
>> kinesis partition & spark streaming partition. It seems to be pretty
>> difficult to align these partitions.
>>
>> Kinesis partitions a stream of data into shards, if we follow the
>> example, we will have multiple kinesis receivers reading from the same
>> stream in spark streaming. It seems like kinesis workers will coordinate
>> among themselves and assign shards to themselves dynamically. For a
>> particular shard, it can be consumed by different kinesis workers (thus
>> different spark workers) dynamically (not at the same time). Blocks are
>> generated based on time intervals, RDD are created based on blocks. RDDs
>> are partitioned based on blocks. At the end, the data for a given shard
>> will be spread into multiple blocks (possible located on different spark
>> worker nodes).
>>
>> We will probably need to group these data again for a given shard and
>> shuffle data around to achieve the same partition we had in Kinesis.
>>
>> Is there a better way to achieve this to avoid data reshuffling?
>>
>> Thanks,
>> Wei
>>
>
>

Reply via email to