[jira] [Created] (FLINK-3231) Handle Kinesis-side resharding in Kinesis streaming consumer

2016-01-13 Thread Tzu-Li (Gordon) Tai (JIRA)
Tzu-Li (Gordon) Tai created FLINK-3231:
--

 Summary: Handle Kinesis-side resharding in Kinesis streaming 
consumer
 Key: FLINK-3231
 URL: https://issues.apache.org/jira/browse/FLINK-3231
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming Connectors
Affects Versions: 1.0.0
Reporter: Tzu-Li (Gordon) Tai


A big difference between Kinesis shards and Kafka partitions is that Kinesis 
users can choose to "merge" and "split" shards at any time for adjustable 
stream throughput capacity. This article explains this quite clearly: 
https://brandur.org/kinesis-by-example.

This will break the static shard-to-task mapping implemented in the basic 
version of the Kinesis consumer 
(https://issues.apache.org/jira/browse/FLINK-3211). The static shard-to-task 
mapping is done in a simple round-robin-like distribution which can be locally 
determined at each Flink consumer task (Flink Kafka consumer does this too).

To handle Kinesis resharding, we will need some way to let the Flink consumer 
tasks coordinate which shards they are currently handling, and allow the tasks 
to ask the coordinator for a shards reassignment when the task finds out it has 
found a closed shard at runtime (shards will be closed by Kinesis when it is 
merged and split).

A possible approach to this is a centralized coordinator state store which is 
visible to all Flink consumer tasks. Tasks can use this state store to locally 
determine what shards it can be reassigned. Zookeeper can be used for this 
state store, but that means it would require the user to set up ZK to work.

Since this feature introduces extensive work, it is opened as a separate 
sub-task from the basic implementation 
https://issues.apache.org/jira/browse/FLINK-3211.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3211) Add AWS Kinesis streaming connector

2016-01-08 Thread Tzu-Li (Gordon) Tai (JIRA)
Tzu-Li (Gordon) Tai created FLINK-3211:
--

 Summary: Add AWS Kinesis streaming connector
 Key: FLINK-3211
 URL: https://issues.apache.org/jira/browse/FLINK-3211
 Project: Flink
  Issue Type: New Feature
  Components: Streaming Connectors
Reporter: Tzu-Li (Gordon) Tai
 Fix For: 1.0.0


AWS Kinesis is a widely adopted message queue used by AWS users, much like a 
cloud service version of Apache Kafka. Support for AWS Kinesis will be a great 
addition to the handful of Flink's streaming connectors to external systems and 
a great reach out to the AWS community.

After a first look at the AWS KCL (Kinesis Client Library), KCL already 
supports stream read beginning from a specific offset (or "record sequence 
number" in Kinesis terminology). For external checkpointing, KCL is designed  
to use AWS DynamoDB to checkpoint application state, where each partition's 
progress (or "shard" in Kinesis terminology) corresponds to a single row in the 
KCL-managed DynamoDB table.

So, implementing the AWS Kinesis connector will very much resemble the work 
done on the Kafka connector, with a few different tweaks as following (I'm 
mainly just rewording [~StephanEwen]'s original description [1]):

1. Determine KCL Shard Worker to Flink source task mapping. KCL already offers 
worker tasks per shard, so we will need to do mapping much like [2].

2. Let the Flink connector also maintain a local copy of application state, 
accessed using KCL API, for the distributed snapshot checkpointing.

3. Restart the KCL at the last Flink local checkpointed record sequence upon 
failure. However, when KCL restarts after failure, it is originally designed to 
reference the external DynamoDB table. Need a further look on how to work with 
this so that the Flink checkpoint and external checkpoint in DynamoDB is 
properly synced.

Most of the details regarding KCL's state checkpointing, sharding, shard 
workers, and failure recovery can be found here [3].

As for the Kinesis sink connector, it should be fairly straightforward and 
almost, if not completely, identical to the Kafka sink.

References:
[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kinesis-Connector-td2872.html
[2] http://data-artisans.com/kafka-flink-a-practical-how-to/
[3] http://docs.aws.amazon.com/kinesis/latest/dev/advanced-consumers.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    4   5   6   7   8   9