+1 for this FLIP. I think this is similar to SQL hints, where users can provide optional information to help the engine execute the workload more efficiently. Having a unified mechanism for such kind of hints should improve usability compared to introducing tons of configuration knobs.
Best, Xintong On Wed, Jun 26, 2024 at 8:09 PM Wencong Liu <liuwencle...@163.com> wrote: > Hi devs, > > > I'm proposing a new FLIP[1] to introduce the ProcessFunction Attribute in > the > DataStream API V2. The goal is to optimize job execution by leveraging > additional information about users' ProcessFunction logic. The proposal > includes > the following scenarios where the ProcessFunction Attribute can > significantly > enhance optimization: > > > Scenario 1: If the framework recognizes that the ProcessFunction outputs > data > only after all input is received, the downstream operators can be > scheduled until > the ProcessFunction is finished, which effectively reduces resource > consumption. > Ignoring this information could lead to premature scheduling of downstream > operators with no data to process. This scenario is addressed and > optimized by FLIP-331[2]. > > > Scenario 2: For stream processing, where users are only interested in the > latest > result per key at the current time, the framework can optimize by > adjusting the > frequency of ProcessFunction outputs. This reduces shuffle data volume and > downstream operator workload. If this optimization is ignored, each new > input > would trigger a new output. This scenario is addressed and > optimized by FLIP-365[3]. > > > Scenario 3: If a user's ProcessFunction neither caches inputs nor outputs, > recognizing this can enable object reuse for this data within the > OperatorChain, > enhancing performance. Without this optimization, data would be copied > before > being passed to the next operator. This scenario is addressed and > optimized by FLIP-329[4]. > > > To unify the mechanism for utilizing additional information and optimizing > jobs, > we propose introducing the ProcessFunction Attribute represented by > Java annotations, which allow users to provide relevant information about > their > ProcessFunctions. The framework can then use this to optimize job > execution. > Importantly, regular job execution remains unaffected whether users use > this > attribute or not. > > > Looking forward to discussing this in the upcoming FLIP. > > > Best regards, > Wencong Liu > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-466%3A+Introduce+ProcessFunction+Attribute+in+DataStream+API+V2 > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-331%3A+Support+EndOfStreamTrigger+and+isOutputOnlyAfterEndOfStream+operator+attribute+to+optimize+task+deployment > [3] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-365%3A+Introduce+flush+interval+to+adjust+the+interval+of+emitting+results+with+idempotent+semantics > [4] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-329%3A+Add+operator+attribute+to+specify+support+for+object-reuse