Dynamic partitioning for stream output

Juho Autio Tue, 24 May 2016 04:59:58 -0700

Could you suggest how to dynamically partition data with Flink streaming?

We've looked at RollingSink, that takes care of writing batches to S3, but
it doesn't allow defining the partition dynamically based on the tuple
fields.


Our data is coming from Kafka and essentially has the kafka topic and a
date, among other fields.

We'd like to consume all topics (also automatically subscribe to new ones)
and write to S3 partitioned by topic and date, for example:

s3://bucket/path/topic=topic2/date=20160522/
s3://bucket/path/topic=topic2/date=20160523/
s3://bucket/path/topic=topic1/date=20160522/
s3://bucket/path/topic=topic1/date=20160523/

There are two problems with RollingSink as it is now:
- Only allows partitioning by date
- Flushes the batch every time the path changes. In our case the stream can
for example have a random mix of different topics and that would mean that
RollingSink isn't able to respect the max flush size but keeps flushing the
files pretty much on every tuple.

We've thought that we could implement a sink that internally creates and
handles multiple RollingSink instances as needed for partitions. But it
would be great to first hear any suggestions that you might have.

If we have to extend RollingSink, it would be nice to make it take a
partitioning function as a parameter. The function would be called for each
tuple to create the output path.



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-partitioning-for-stream-output-tp7122.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.

Dynamic partitioning for stream output

Reply via email to