Hi,Wencong
I have two small questions.
1. Add record will be emitted from `Upstream Operator` to `Local
Sample`? If not, what is the sample rule?
2. From pip, I infer that the record in `Assign Range Index` should
wait for the broadcast result from `Global Sample`,So How long do they
wait? Until all records have been consumed by `Local Sample` or not?

Best,
wangwj

On Mon, Apr 22, 2024 at 6:20 PM Jingsong Li <[email protected]> wrote:
>
> +1 for your proposal.
>
> You can add to the description.
>
> Best,
> Jingsong
>
> On Mon, Apr 22, 2024 at 6:15 PM Wencong Liu <[email protected]> wrote:
> >
> > Hi Jinsong,
> >
> >
> >
> >
> > This topic requires discussion, hence it wasn't directly addressed in the 
> > PIP.
> >
> >
> >
> > I believe the type of sorting algorithm to use depends on the number of
> > fields specified by the user for comparison. When only one comparison field 
> > is
> > specified, it's best to use basic data types for direct comparison for the 
> > most accurate
> > results. For multiple comparison fields, both the Z-order curve and Hilbert 
> > curve algorithms
> > are suitable. In such cases, data maintains a certain level of order in any 
> > comparison
> > field. Generally, the computation cost of the Z-order curve algorithm is 
> > lower
> > than that of the Hilbert curve algorithm. However, in high-dimensional
> > scenarios, the Hilbert curve has an advantage in maintaining data 
> > clustering.
> >
> >
> > Therefore, I propose an automatic selection based on the number of
> > comparison columns:
> >
> >
> >
> >
> > 1 column: Basic type comparison algorithm.
> >
> > Less than 5 columns: Z-order curve algorithm.
> >
> > 5 or more columns: Hilbert curve algorithm.
> >
> >
> >
> >
> > The threshold of 5 columns is based on Ververica's practice with Paimon
> > Append Scalable tables, which was also discussed offline with Junhao Ye.
> > In addition to automatic configuration, users can fine-tune for specific
> > scenarios by explicitly specifying the desired comparison strategy.
> >
> >
> > WDYT?
> >
> >
> >
> > Best,
> >
> > Wencong
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote:
> > >Hi Wencong,
> > >
> > >Mostly looks good to me.
> > >
> > >"it will automatically determine the algorithm based on the number of
> > >columns in 'sink.clustering.by-columns'. "
> > >
> > >Please describe this clearly in the `Description`.
> > >
> > >Best,
> > >Jingsong
> > >
> > >On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote:
> > >>
> > >> Hi devs,
> > >>
> > >>
> > >>
> > >>
> > >> I'm proposing a new feature to introduce range partitioning and sorting 
> > >> in append scalable table
> > >>
> > >> writing for Flink. The goal is to optimize query performance by reducing 
> > >> data scans on large datasets.
> > >>
> > >>
> > >>
> > >>
> > >> The proposal includes:
> > >>
> > >>
> > >>
> > >>
> > >> 1. Configurable range partitioning and sorting during data writing which 
> > >> allows for
> > >>
> > >> a more efficient data distribution strategy.
> > >>
> > >>
> > >>
> > >>
> > >> 2. Introduction of new configurations that will enable users to specify 
> > >> columns for
> > >>
> > >> comparison, choose a comparison algorithm for range partitioning, and 
> > >> further sort each
> > >>
> > >> partition if required.
> > >>
> > >>
> > >>
> > >>
> > >> 3. Detailed explanation of the division of processing steps when range 
> > >> partitioning
> > >>
> > >> is enabled and the conditional inclusion of the sorting phase.
> > >>
> > >>
> > >>
> > >>
> > >> Looking forward to discussing this in the upcoming PIP [1].
> > >>
> > >>
> > >>
> > >>
> > >> Best regards,
> > >>
> > >> Wencong Liu
> > >>
> > >>
> > >>
> > >>
> > >> [1] 
> > >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink

Reply via email to