Re: [DISCUSS] streaming shuffle to improve data clustering and tame small files problem

Jark Wu Mon, 30 Jan 2023 18:16:23 -0800

Thank Steven, for starting this discussion.

As I suggested in the previous thread, this can be a joint effort
beneficial for various projects.
I would also like to hear opinions from @Jingsong Li
<jingsongl...@gmail.com>, who is maintaining Flink Table Store.


Best,
Jark

On Tue, 31 Jan 2023 at 08:46, Steven Wu <stevenz...@gmail.com> wrote:

> Hi,
>
> We had a proposal to add a streaming shuffling stage in the Flink Iceberg
> sink to to improve data clustering and tame the small files problem [1].
>
> Here are a couple of common use cases.
> * Event time partitioned table where we can get small files problem due to
> skewed and long-tail distribution on event time hours.
> * Improve data clustering on non-partitioned columns (e.g. device_id) where
> table format can leverage min-max value range for effective file pruning.
>
> The main idea is to calculate (skewed) traffic distribution statistics and
> shuffle records based on the computed statistics. This can achieve good
> data clustering on the writer subtasks while largely avoiding small files
> and maintaining relatively balanced traffic volume across writer subtasks.
> We finished a PoC on event time partitioned tables and saw 20x reduction on
> number of files.
>
> In another thread, there is a question if it makes sense to add this
> clustering shuffle feature to Flink DataStream [2], as it can potentially
> be useful for other sinks (like files, Apache Hudi, Delta Lake). Hence we
> would like to gauge the community's initial interests first before writing
> up a large FLIP.
>
> Thanks,
> Steven
>
> [1]
>
> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo
> [2]
>
> https://lists.apache.org/list?dev@flink.apache.org:lte=1M:%22[DISCUSS]%20FLIP-264%20Extract%20BaseCoordinatorContext%22
>

Re: [DISCUSS] streaming shuffle to improve data clustering and tame small files problem

Reply via email to