[
https://issues.apache.org/jira/browse/FLINK-37844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-37844:
-----------------------------------
Labels: pull-request-available (was: )
> Push down projections for StreamingMultiJoinOperator
> ----------------------------------------------------
>
> Key: FLINK-37844
> URL: https://issues.apache.org/jira/browse/FLINK-37844
> Project: Flink
> Issue Type: Sub-task
> Reporter: Gustavo de Morais
> Assignee: Gustavo de Morais
> Priority: Major
> Labels: pull-request-available
>
> We're currently adding support for a StreamingMultiJoinOperator which is able
> to join N inputs. There are multiple minor optimizations we might be able to
> do that weren't so easy to do with multiple chained binary joins. One of them
> is materializing into state only attributes that are either joined in any of
> the N - 1 join conditions or are projected in the final output. We'd have to
> do the following:
>
> * We already have the information of used fields for each input in
> joinAttributeMap and can either pass that to the operator or add a new method
> to the join extractor.
> * The MultiJoin will contain the list of fields to be projected. We might
> have to adapt and expose that as a map per inputid when creating the
> FlinkMultiJoin.
> * When adding a record to state, we remove attributes that will not be used
> in join conditions or projected.
> * If we use null for these attributes, we don't have to adapt the logic. If
> we recreate rows with a smaller arity, multiple places have to be adjusted so
> that all our index-based logic is updated and correct.
>
> Obs: this was a even more significant problem for binary joins, since we
> materialized all attributes for all intermediate results. However, it's also
> relevant here. I plan to measure impacts for each of the optimizations before
> adding them [based on a
> benchmark|https://github.com/apache/flink-benchmarks?tab=readme-ov-file#general-remarks],
> and we'll first merge the operator. However, I'll be documenting the
> optimizations with tickets so we track them here. This ticket arose from a
> discussion with [~roman]
> [here.|https://github.com/apache/flink/pull/26313#discussion_r2105917437]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)