Thank Steven, for starting this discussion. As I suggested in the previous thread, this can be a joint effort beneficial for various projects. I would also like to hear opinions from @Jingsong Li <jingsongl...@gmail.com>, who is maintaining Flink Table Store.
Best, Jark On Tue, 31 Jan 2023 at 08:46, Steven Wu <stevenz...@gmail.com> wrote: > Hi, > > We had a proposal to add a streaming shuffling stage in the Flink Iceberg > sink to to improve data clustering and tame the small files problem [1]. > > Here are a couple of common use cases. > * Event time partitioned table where we can get small files problem due to > skewed and long-tail distribution on event time hours. > * Improve data clustering on non-partitioned columns (e.g. device_id) where > table format can leverage min-max value range for effective file pruning. > > The main idea is to calculate (skewed) traffic distribution statistics and > shuffle records based on the computed statistics. This can achieve good > data clustering on the writer subtasks while largely avoiding small files > and maintaining relatively balanced traffic volume across writer subtasks. > We finished a PoC on event time partitioned tables and saw 20x reduction on > number of files. > > In another thread, there is a question if it makes sense to add this > clustering shuffle feature to Flink DataStream [2], as it can potentially > be useful for other sinks (like files, Apache Hudi, Delta Lake). Hence we > would like to gauge the community's initial interests first before writing > up a large FLIP. > > Thanks, > Steven > > [1] > > https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo > [2] > > https://lists.apache.org/list?dev@flink.apache.org:lte=1M:%22[DISCUSS]%20FLIP-264%20Extract%20BaseCoordinatorContext%22 >