Re: [DISCUSS] FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-17 Thread Venkatakrishnan Sowrirajan
> > >> > > > > support for external storage.
> > >> > > > >
> > >> > > > > Big +1 for this one!
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > >
> > >> > > > > Weijie
> > >> > > > >
> > >> > > > >
> > >> > > > > rexxiong  于2024年6月5日周三 00:08写道:
> > >> > > > >
> > >> > > > > > Thanks Yuxin for the proposal. +1,  as a member of the
> Apache
> > >> > > Celeborn
> > >> > > > > > community, I am very excited about the integration of
> Flink's
> > >> > Hybrid
> > >> > > > > > Shuffle with Apache Celeborn. The whole design of CIP-6
> looks
> > >> good
> > >> > to
> > >> > > > > me. I
> > >> > > > > > am looking forward to this integration.
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Jiashu Xiong
> > >> > > > > >
> > >> > > > > > Ethan Feng  于2024年6月4日周二 16:47写道:
> > >> > > > > >
> > >> > > > > > > +1 for this proposal.
> > >> > > > > > >
> > >> > > > > > > After internally reviewing the prototype of CIP-6, this
> > would
> > >> > > improve
> > >> > > > > > > performance and stability for Flink users using Celeborn.
> > >> > > > > > >
> > >> > > > > > > Expect to see this feature come out to the community.
> > >> > > > > > >
> > >> > > > > > > As I come from the Celeborn community, I hope more users
> can
> > >> try
> > >> > to
> > >> > > > > > > use Celeborn when there are Flink batch jobs.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > > Ethan Feng
> > >> > > > > > >
> > >> > > > > > > Yuxin Tan  于2024年6月4日周二 16:34写道:
> > >> > > > > > > >
> > >> > > > > > > > Hi, Venkatakrishnan,
> > >> > > > > > > >
> > >> > > > > > > > Thanks for joining the discussion. We appreciate your
> > >> interest
> > >> > > > > > > > in contributing to the work. Once the FLIP and CIP
> > proposals
> > >> > > > > > > > have been approved, we will create some JIRA tickets in
> > >> Flink
> > >> > > > > > > > and Celeborn projects. Please feel free to take a look
> at
> > >> the
> > >> > > > > > > > tickets and select any that resonate with your
> interests.
> > >> > > > > > > >
> > >> > > > > > > > Best,
> > >> > > > > > > > Yuxin
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Venkatakrishnan Sowrirajan 
> > 于2024年5月31日周五
> > >> > > > 23:11写道:
> > >> > > > > > > >
> > >> > > > > > > > > Thanks for this FLIP. We are also interested in
> > >> > > > > learning/contributing
> > >> > > > > > > to
> > >> > > > > > > > > the hybrid shuffle integration with celeborn for batch
> > >> > > > executions.
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, May 28, 2024, 7:07 PM Yuxin Tan <
> > >> > > tanyuxinw...@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi, Xintong,
> > >> > > > > > > > > >
> > >> > > > > > > > > > >  I think we can also publish the prototype codes
> so
> > >> the
> > >> > > > > > > > > > community c

Re: [VOTE] FLIP-464: Merge "flink run" and "flink run-application"

2024-06-16 Thread Venkatakrishnan Sowrirajan
+1. Thanks for driving this proposal, Ferenc!

Regards
Venkata krishnan


On Thu, Jun 13, 2024 at 10:54 AM Jeyhun Karimov 
wrote:

> Thanks for driving this.
> +1 (non-binding)
>
> Regards,
> Jeyhun
>
> On Thu, Jun 13, 2024 at 5:23 PM Gabor Somogyi 
> wrote:
>
> > +1 (binding)
> >
> > G
> >
> >
> > On Wed, Jun 12, 2024 at 5:23 PM Ferenc Csaky  >
> > wrote:
> >
> > > Hello devs,
> > >
> > > I would like to start a vote about FLIP-464 [1]. The FLIP is about to
> > > merge back the
> > > "flink run-application" functionality to "flink run", so the latter
> will
> > > be capable to deploy jobs in
> > > all deployment modes. More details in the FLIP. Discussion thread [2].
> > >
> > > The vote will be open for at least 72 hours (until 2024 March 23 14:03
> > > UTC) unless there
> > > are any objections or insufficient votes.
> > >
> > > Thanks,Ferenc
> > >
> > > [1]
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=311626179__;!!IKRxdwAv5BmarQ!fw7_SWUS3G8imqL4w4z0MejMShCR1pHlYxeTnLFJqIu6sI05EF1rM_n1kw8lESNgRzxPqstJC3ITNwDSp1Jf-aA$
> > > [2]
> https://urldefense.com/v3/__https://lists.apache.org/thread/xh58xs0y58kqjmfvd4yor79rv6dlcg5g__;!!IKRxdwAv5BmarQ!fw7_SWUS3G8imqL4w4z0MejMShCR1pHlYxeTnLFJqIu6sI05EF1rM_n1kw8lESNgRzxPqstJC3ITNwDSOIWCchM$
> >
>


Re: [VOTE] FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-06-09 Thread Venkatakrishnan Sowrirajan
Thanks for adding this new support. +1 (non-binding)

On Sat, Jun 8, 2024, 3:26 PM Ahmed Hamdy  wrote:

> +1 (non-binding)
> Best Regards
> Ahmed Hamdy
>
>
> On Sat, 8 Jun 2024 at 22:26, Jeyhun Karimov  wrote:
>
> > Hi Yuxin,
> >
> > Thanks for driving this.
> > +1 (non-binding)
> >
> > Regards,
> > Jeyhun
> >
> > On Fri, Jun 7, 2024 at 6:05 PM Jim Hughes 
> > wrote:
> >
> > > HI all,
> > >
> > > +1 (non-binding)
> > >
> > > Cheers,
> > >
> > > Jim
> > >
> > > On Fri, Jun 7, 2024 at 4:03 AM Yuxin Tan 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Thanks for all the feedback about the FLIP-459 Support Flink
> > > > hybrid shuffle integration with Apache Celeborn[1].
> > > > The discussion thread is here [2].
> > > >
> > > > I'd like to start a vote for it. The vote will be open for at least
> > > > 72 hours unless there is an objection or insufficient votes.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-459*3A*Support*Flink*hybrid*shuffle*integration*with*Apache*Celeborn__;JSsrKysrKysr!!IKRxdwAv5BmarQ!ckFoYTA_CEJtmHerB8mgy6Ch-q-fxi9kFARc6zxNNnRWcM7t8wBqzSK-1MCQQ7GaOUjSS618gftZ5GDYc3ynGz4$
> > > > [2]
> https://urldefense.com/v3/__https://lists.apache.org/thread/gy7sm7qrf7yrv1rl5f4vtk5fo463ts33__;!!IKRxdwAv5BmarQ!ckFoYTA_CEJtmHerB8mgy6Ch-q-fxi9kFARc6zxNNnRWcM7t8wBqzSK-1MCQQ7GaOUjSS618gftZ5GDY_O3QZuY$
> > > >
> > > > Best,
> > > > Yuxin
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-459: Support Flink hybrid shuffle integration with Apache Celeborn

2024-05-31 Thread Venkatakrishnan Sowrirajan
Thanks for this FLIP. We are also interested in learning/contributing to
the hybrid shuffle integration with celeborn for batch executions.

On Tue, May 28, 2024, 7:07 PM Yuxin Tan  wrote:

> Hi, Xintong,
>
> >  I think we can also publish the prototype codes so the
> community can better understand and help with it.
>
> Ok, I agree on the point. I will prepare and publish the code
> recently.
>
> Rui,
>
> > Kindly reminder: the image of CIP-6[1] cannot be loaded.
>
> Thanks for the reminder. I've updated the images.
>
>
> Best,
> Yuxin
>
>
> Rui Fan <1996fan...@gmail.com> 于2024年5月29日周三 09:33写道:
>
> > Thanks Yuxin for driving this proposal!
> >
> > Kindly reminder: the image of CIP-6[1] cannot be loaded.
> >
> > [1]
> >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6*Support*Flink*hybrid*shuffle*integration*with*Apache*Celeborn__;KysrKysrKys!!IKRxdwAv5BmarQ!ZRTc1aUSYMDBazuIwlet1Dzk2_DD9qKTgoDLH9jSwAVLgwplcuId_8JoXkH0i7AeWxKWXkL0sxM3AeW-H9OJ6v9uGw$
> >
> > Best,
> > Rui
> >
> > On Wed, May 29, 2024 at 9:03 AM Xintong Song 
> > wrote:
> >
> > > +1 for this proposal.
> > >
> > > We have been prototyping this feature internally at Alibaba for a
> couple
> > of
> > > months. Yuxin, I think we can also publish the prototype codes so the
> > > community can better understand and help with it.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Tue, May 28, 2024 at 8:34 PM Yuxin Tan 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to start a discussion on FLIP-459 Support Flink hybrid
> > > shuffle
> > > > integration with
> > > > Apache Celeborn[1]. Flink hybrid shuffle supports transitions between
> > > > memory, disk, and
> > > > remote storage to improve performance and job stability.
> Concurrently,
> > > > Apache Celeborn
> > > > provides a stable, performant, scalable remote shuffle service. This
> > > > integration proposal is to
> > > > harness the benefits from both hybrid shuffle and Celeborn
> > > simultaneously.
> > > >
> > > > Note that this proposal has two parts.
> > > > 1. The Flink-side modifications are in FLIP-459[1].
> > > > 2. The Celeborn-side changes are in CIP-6[2].
> > > >
> > > > Looking forward to everyone's feedback and suggestions. Thank you!
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-459*3A*Support*Flink*hybrid*shuffle*integration*with*Apache*Celeborn__;JSsrKysrKysr!!IKRxdwAv5BmarQ!ZRTc1aUSYMDBazuIwlet1Dzk2_DD9qKTgoDLH9jSwAVLgwplcuId_8JoXkH0i7AeWxKWXkL0sxM3AeW-H9MaOGE7hQ$
> > > > [2]
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6*Support*Flink*hybrid*shuffle*integration*with*Apache*Celeborn__;KysrKysrKys!!IKRxdwAv5BmarQ!ZRTc1aUSYMDBazuIwlet1Dzk2_DD9qKTgoDLH9jSwAVLgwplcuId_8JoXkH0i7AeWxKWXkL0sxM3AeW-H9OJ6v9uGw$
> > > >
> > > > Best,
> > > > Yuxin
> > > >
> > >
> >
>


Re: Question around Flink's AdaptiveBatchScheduler

2024-05-09 Thread Venkatakrishnan Sowrirajan
Xia,

Thanks for the reviews. Unfortunately due to work commitments I am little
delayed in addressing your review comments. Mostly will be done by end of
this week. Just a quick heads up.

Jinrui,

Thanks, that would be great.

On Mon, May 6, 2024, 12:45 AM Junrui Lee  wrote:

> Hi,
> Thanks for the reminder. I will review it soon during my free time.
>
> Venkatakrishnan Sowrirajan  于2024年5月4日周六 10:10写道:
>
> > Jinrui and Xia
> >
> > Gentle ping for reviews.
> >
> > On Mon, Apr 29, 2024, 8:28 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu
> > >
> > wrote:
> >
> > > Hi Xia and Jinrui,
> > >
> > > Filed
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/24736__;!!IKRxdwAv5BmarQ!YHVAC2UWD7ITI-Xyk6Flu6WhuSWYsHTLCOtJkLUhtIyojo0OxQOfsQmoS7d9-q0OhcRzbZ0hjO19qbictUrQLQ$
> to address the above
> > > described issue. Please take a look whenever you can.
> > >
> > > Thanks
> > > Venkat
> > >
> > >
> > > On Thu, Apr 18, 2024 at 12:16 PM Venkatakrishnan Sowrirajan <
> > > vsowr...@asu.edu> wrote:
> > >
> > >> Filed
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-35165__;!!IKRxdwAv5BmarQ!YHVAC2UWD7ITI-Xyk6Flu6WhuSWYsHTLCOtJkLUhtIyojo0OxQOfsQmoS7d9-q0OhcRzbZ0hjO19qbim4QLkBQ$
> to address the
> > >> above described issue. Will share the PR here once it is ready for
> > review.
> > >>
> > >> Regards
> > >> Venkata krishnan
> > >>
> > >>
> > >> On Wed, Apr 17, 2024 at 5:32 AM Junrui Lee 
> wrote:
> > >>
> > >>> Thanks Venkata and Xia for providing further clarification. I think
> > your
> > >>> example illustrates the significance of this proposal very well.
> Please
> > >>> feel free go ahead and address the concerns.
> > >>>
> > >>> Best,
> > >>> Junrui
> > >>>
> > >>> Venkatakrishnan Sowrirajan  于2024年4月16日周二 07:01写道:
> > >>>
> > >>> > Thanks for adding your thoughts to this discussion.
> > >>> >
> > >>> > If we all agree that the source vertex parallelism shouldn't be
> bound
> > >>> by
> > >>> > the downstream max parallelism
> > >>> > (jobmanager.adaptive-batch-scheduler.max-parallelism)
> > >>> > based on the rationale and the issues described above, I can take a
> > >>> stab at
> > >>> > addressing the issue.
> > >>> >
> > >>> > Let me file a ticket to track this issue. Otherwise, I'm looking
> > >>> forward to
> > >>> > hearing more thoughts from others as well, especially Lijie and
> > Junrui
> > >>> who
> > >>> > have more context on the AdaptiveBatchScheduler.
> > >>> >
> > >>> > Regards
> > >>> > Venkata krishnan
> > >>> >
> > >>> >
> > >>> > On Mon, Apr 15, 2024 at 12:54 AM Xia Sun 
> > wrote:
> > >>> >
> > >>> > > Hi Venkat,
> > >>> > > I agree that the parallelism of source vertex should not be upper
> > >>> bounded
> > >>> > > by the job's global max parallelism. The case you mentioned, >>
> > High
> > >>> > filter
> > >>> > > selectivity with huge amounts of data to read  excellently
> supports
> > >>> this
> > >>> > > viewpoint. (In fact, in the current implementation, if the source
> > >>> > > parallelism is pre-specified at job create stage, rather than
> > >>> relying on
> > >>> > > the dynamic parallelism inference of the AdaptiveBatchScheduler,
> > the
> > >>> > source
> > >>> > > vertex's parallelism can indeed exceed the job's global max
> > >>> parallelism.)
> > >>> > >
> > >>> > > As Lijie and Junrui pointed out, the key issue is "semantic
> > >>> consistency."
> > >>> > > Currently, if a vertex has not set maxParallelism, the
> > >>> > > AdaptiveBatchScheduler will use
> > >>> > > `execution.batch.adaptive.auto-parallelism.max-parallelism` as
> the
> > >>> > vertex's
> > >>> > > maxParallelism. Since the current implementation does not
> > distinguish

Re: Question around Flink's AdaptiveBatchScheduler

2024-05-03 Thread Venkatakrishnan Sowrirajan
Jinrui and Xia

Gentle ping for reviews.

On Mon, Apr 29, 2024, 8:28 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi Xia and Jinrui,
>
> Filed https://github.com/apache/flink/pull/24736 to address the above
> described issue. Please take a look whenever you can.
>
> Thanks
> Venkat
>
>
> On Thu, Apr 18, 2024 at 12:16 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
>> Filed https://issues.apache.org/jira/browse/FLINK-35165 to address the
>> above described issue. Will share the PR here once it is ready for review.
>>
>> Regards
>> Venkata krishnan
>>
>>
>> On Wed, Apr 17, 2024 at 5:32 AM Junrui Lee  wrote:
>>
>>> Thanks Venkata and Xia for providing further clarification. I think your
>>> example illustrates the significance of this proposal very well. Please
>>> feel free go ahead and address the concerns.
>>>
>>> Best,
>>> Junrui
>>>
>>> Venkatakrishnan Sowrirajan  于2024年4月16日周二 07:01写道:
>>>
>>> > Thanks for adding your thoughts to this discussion.
>>> >
>>> > If we all agree that the source vertex parallelism shouldn't be bound
>>> by
>>> > the downstream max parallelism
>>> > (jobmanager.adaptive-batch-scheduler.max-parallelism)
>>> > based on the rationale and the issues described above, I can take a
>>> stab at
>>> > addressing the issue.
>>> >
>>> > Let me file a ticket to track this issue. Otherwise, I'm looking
>>> forward to
>>> > hearing more thoughts from others as well, especially Lijie and Junrui
>>> who
>>> > have more context on the AdaptiveBatchScheduler.
>>> >
>>> > Regards
>>> > Venkata krishnan
>>> >
>>> >
>>> > On Mon, Apr 15, 2024 at 12:54 AM Xia Sun  wrote:
>>> >
>>> > > Hi Venkat,
>>> > > I agree that the parallelism of source vertex should not be upper
>>> bounded
>>> > > by the job's global max parallelism. The case you mentioned, >> High
>>> > filter
>>> > > selectivity with huge amounts of data to read  excellently supports
>>> this
>>> > > viewpoint. (In fact, in the current implementation, if the source
>>> > > parallelism is pre-specified at job create stage, rather than
>>> relying on
>>> > > the dynamic parallelism inference of the AdaptiveBatchScheduler, the
>>> > source
>>> > > vertex's parallelism can indeed exceed the job's global max
>>> parallelism.)
>>> > >
>>> > > As Lijie and Junrui pointed out, the key issue is "semantic
>>> consistency."
>>> > > Currently, if a vertex has not set maxParallelism, the
>>> > > AdaptiveBatchScheduler will use
>>> > > `execution.batch.adaptive.auto-parallelism.max-parallelism` as the
>>> > vertex's
>>> > > maxParallelism. Since the current implementation does not distinguish
>>> > > between source vertices and downstream vertices, source vertices are
>>> also
>>> > > subject to this limitation.
>>> > >
>>> > > Therefore, I believe that if the issue of "semantic consistency" can
>>> be
>>> > > well explained in the code and configuration documentation, the
>>> > > AdaptiveBatchScheduler should support that the parallelism of source
>>> > > vertices can exceed the job's global max parallelism.
>>> > >
>>> > > Best,
>>> > > Xia
>>> > >
>>> > > Venkatakrishnan Sowrirajan  于2024年4月14日周日 10:31写道:
>>> > >
>>> > > > Let me state why I think "*jobmanager.adaptive-batch-sche*
>>> > > > *duler.default-source-parallelism*" should not be bound by the "
>>> > > > *jobmanager.adaptive-batch-sche**duler.max-parallelism*".
>>> > > >
>>> > > >- Source vertex is unique and does not have any upstream
>>> vertices
>>> > > >- Downstream vertices read shuffled data partitioned by key,
>>> which
>>> > is
>>> > > >not the case for the Source vertex
>>> > > >- Limiting source parallelism by downstream vertices' max
>>> > parallelism
>>> > > is
>>> > > >incorrect
>>> > > >
>>> > > > If 

Re: Question around Flink's AdaptiveBatchScheduler

2024-04-29 Thread Venkatakrishnan Sowrirajan
Hi Xia and Jinrui,

Filed https://github.com/apache/flink/pull/24736 to address the above
described issue. Please take a look whenever you can.

Thanks
Venkat


On Thu, Apr 18, 2024 at 12:16 PM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> Filed https://issues.apache.org/jira/browse/FLINK-35165 to address the
> above described issue. Will share the PR here once it is ready for review.
>
> Regards
> Venkata krishnan
>
>
> On Wed, Apr 17, 2024 at 5:32 AM Junrui Lee  wrote:
>
>> Thanks Venkata and Xia for providing further clarification. I think your
>> example illustrates the significance of this proposal very well. Please
>> feel free go ahead and address the concerns.
>>
>> Best,
>> Junrui
>>
>> Venkatakrishnan Sowrirajan  于2024年4月16日周二 07:01写道:
>>
>> > Thanks for adding your thoughts to this discussion.
>> >
>> > If we all agree that the source vertex parallelism shouldn't be bound by
>> > the downstream max parallelism
>> > (jobmanager.adaptive-batch-scheduler.max-parallelism)
>> > based on the rationale and the issues described above, I can take a
>> stab at
>> > addressing the issue.
>> >
>> > Let me file a ticket to track this issue. Otherwise, I'm looking
>> forward to
>> > hearing more thoughts from others as well, especially Lijie and Junrui
>> who
>> > have more context on the AdaptiveBatchScheduler.
>> >
>> > Regards
>> > Venkata krishnan
>> >
>> >
>> > On Mon, Apr 15, 2024 at 12:54 AM Xia Sun  wrote:
>> >
>> > > Hi Venkat,
>> > > I agree that the parallelism of source vertex should not be upper
>> bounded
>> > > by the job's global max parallelism. The case you mentioned, >> High
>> > filter
>> > > selectivity with huge amounts of data to read  excellently supports
>> this
>> > > viewpoint. (In fact, in the current implementation, if the source
>> > > parallelism is pre-specified at job create stage, rather than relying
>> on
>> > > the dynamic parallelism inference of the AdaptiveBatchScheduler, the
>> > source
>> > > vertex's parallelism can indeed exceed the job's global max
>> parallelism.)
>> > >
>> > > As Lijie and Junrui pointed out, the key issue is "semantic
>> consistency."
>> > > Currently, if a vertex has not set maxParallelism, the
>> > > AdaptiveBatchScheduler will use
>> > > `execution.batch.adaptive.auto-parallelism.max-parallelism` as the
>> > vertex's
>> > > maxParallelism. Since the current implementation does not distinguish
>> > > between source vertices and downstream vertices, source vertices are
>> also
>> > > subject to this limitation.
>> > >
>> > > Therefore, I believe that if the issue of "semantic consistency" can
>> be
>> > > well explained in the code and configuration documentation, the
>> > > AdaptiveBatchScheduler should support that the parallelism of source
>> > > vertices can exceed the job's global max parallelism.
>> > >
>> > > Best,
>> > > Xia
>> > >
>> > > Venkatakrishnan Sowrirajan  于2024年4月14日周日 10:31写道:
>> > >
>> > > > Let me state why I think "*jobmanager.adaptive-batch-sche*
>> > > > *duler.default-source-parallelism*" should not be bound by the "
>> > > > *jobmanager.adaptive-batch-sche**duler.max-parallelism*".
>> > > >
>> > > >- Source vertex is unique and does not have any upstream vertices
>> > > >- Downstream vertices read shuffled data partitioned by key,
>> which
>> > is
>> > > >not the case for the Source vertex
>> > > >- Limiting source parallelism by downstream vertices' max
>> > parallelism
>> > > is
>> > > >incorrect
>> > > >
>> > > > If we say for ""semantic consistency" the source vertex parallelism
>> has
>> > > to
>> > > > be bound by the overall job's max parallelism, it can lead to
>> following
>> > > > issues:
>> > > >
>> > > >- High filter selectivity with huge amounts of data to read -
>> > setting
>> > > >high "*jobmanager.adaptive-batch-scheduler.max-parallelism*" so
>> that
>> > > >source parallelism can be set higher can lead to small bloc

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-24 Thread Venkatakrishnan Sowrirajan
Hi Xia,

+1 on introducing dynamic parallelism inference for HiveSource.

Orthogonal to this discussion, curious, how commonly HiveSource is used
these days in the industry given the popularity of table formats/sources
like Iceberg, Hudi and Delta lake?

Thanks
Venkat

On Wed, Apr 24, 2024, 7:41 PM Xia Sun  wrote:

> Hi everyone,
>
> Thanks for all the feedback!
>
> If there are no more comments, I would like to start the vote thread,
> thanks again!
>
> Best,
> Xia
>
> Ahmed Hamdy  于2024年4月18日周四 21:31写道:
>
> > Hi Xia,
> > I have read through the FLIP and discussion and the new version of the
> FLIP
> > looks better.
> > +1 for the proposal.
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Thu, 18 Apr 2024 at 12:21, Ron Liu  wrote:
> >
> > > Hi, Xia
> > >
> > > Thanks for updating, looks good to me.
> > >
> > > Best,
> > > Ron
> > >
> > > Xia Sun  于2024年4月18日周四 19:11写道:
> > >
> > > > Hi Ron,
> > > > Yes, presenting it in a table might be more intuitive. I have already
> > > added
> > > > the table in the "Public Interfaces | New Config Option" chapter of
> > FLIP.
> > > > PTAL~
> > > >
> > > > Ron Liu  于2024年4月18日周四 18:10写道:
> > > >
> > > > > Hi, Xia
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > > That means, in terms
> > > > > of priority, `table.exec.hive.infer-source-parallelism` >
> > > > > `table.exec.hive.infer-source-parallelism.mode`.
> > > > >
> > > > > I still have some confusion, if the
> > > > > `table.exec.hive.infer-source-parallelism`
> > > > > >`table.exec.hive.infer-source-parallelism.mode`, currently
> > > > > `table.exec.hive.infer-source-parallelism` default value is true,
> > that
> > > > > means always static parallelism inference work? Or perhaps after
> this
> > > > FLIP,
> > > > > we changed the default behavior of
> > > > > `table.exec.hive.infer-source-parallelism` to indicate dynamic
> > > > parallelism
> > > > > inference when enabled.
> > > > > I think you should list the various behaviors of these two options
> > that
> > > > > coexist in FLIP by a table, only then users can know how the
> dynamic
> > > and
> > > > > static parallelism inference work.
> > > > >
> > > > > Best,
> > > > > Ron
> > > > >
> > > > > Xia Sun  于2024年4月18日周四 16:33写道:
> > > > >
> > > > > > Hi Ron and Lijie,
> > > > > > Thanks for joining the discussion and sharing your suggestions.
> > > > > >
> > > > > > > the InferMode class should also be introduced in the Public
> > > > Interfaces
> > > > > > > section!
> > > > > >
> > > > > >
> > > > > > Thanks for the reminder, I have now added the InferMode class to
> > the
> > > > > Public
> > > > > > Interfaces section as well.
> > > > > >
> > > > > > > `table.exec.hive.infer-source-parallelism.max` is 1024, I
> checked
> > > > > through
> > > > > > > the code that the default value is 1000?
> > > > > >
> > > > > >
> > > > > > I have checked and the default value of
> > > > > > `table.exec.hive.infer-source-parallelism.max` is indeed 1000.
> This
> > > has
> > > > > > been corrected in the FLIP.
> > > > > >
> > > > > > > how are`table.exec.hive.infer-source-parallelism` and
> > > > > > > `table.exec.hive.infer-source-parallelism.mode` compatible?
> > > > > >
> > > > > >
> > > > > > This is indeed a critical point. The current plan is to deprecate
> > > > > > `table.exec.hive.infer-source-parallelism` but still utilize it
> as
> > > the
> > > > > main
> > > > > > switch for enabling automatic parallelism inference. That means,
> in
> > > > terms
> > > > > > of priority, `table.exec.hive.infer-source-parallelism` >
> > > > > > `table.exec.hive.infer-source-parallelism.mode`. In future
> > versions,
> > > if
> > > > > > `table.exec.hive.infer-source-parallelism` is removed, this logic
> > > will
> > > > > also
> > > > > > need to be revised, leaving only
> > > > > > `table.exec.hive.infer-source-parallelism.mode` as the basis for
> > > > deciding
> > > > > > whether to enable parallelism inference. I have also added this
> > > > > description
> > > > > > to the FLIP.
> > > > > >
> > > > > >
> > > > > > > In FLIP-367 it is supported to be able to set the Source's
> > > > parallelism
> > > > > > > individually, if in the future HiveSource also supports this
> > > feature,
> > > > > > > however, the default value of
> > > > > > > `table.exec.hive.infer-source-parallelism.mode` is
> > > > `InferMode.DYNAMIC`,
> > > > > > at
> > > > > > > this point will the parallelism be dynamically derived or will
> > the
> > > > > > manually
> > > > > > > set parallelism take effect, and who has the higher priority?
> > > > > >
> > > > > >
> > > > > > From my understanding, 'manually set parallelism' has the higher
> > > > > priority,
> > > > > > just like one of the preconditions for the effectiveness of
> dynamic
> > > > > > parallelism inference in the AdaptiveBatchScheduler is that the
> > > > vertex's
> > > > > > parallelism isn't set. I believe whether it's static inference or
> > > > dynamic
> > > > > > inference, the manually set parallelism by the user 

Re: Question around Flink's AdaptiveBatchScheduler

2024-04-18 Thread Venkatakrishnan Sowrirajan
Filed https://issues.apache.org/jira/browse/FLINK-35165 to address the
above described issue. Will share the PR here once it is ready for review.

Regards
Venkata krishnan


On Wed, Apr 17, 2024 at 5:32 AM Junrui Lee  wrote:

> Thanks Venkata and Xia for providing further clarification. I think your
> example illustrates the significance of this proposal very well. Please
> feel free go ahead and address the concerns.
>
> Best,
> Junrui
>
> Venkatakrishnan Sowrirajan  于2024年4月16日周二 07:01写道:
>
> > Thanks for adding your thoughts to this discussion.
> >
> > If we all agree that the source vertex parallelism shouldn't be bound by
> > the downstream max parallelism
> > (jobmanager.adaptive-batch-scheduler.max-parallelism)
> > based on the rationale and the issues described above, I can take a stab
> at
> > addressing the issue.
> >
> > Let me file a ticket to track this issue. Otherwise, I'm looking forward
> to
> > hearing more thoughts from others as well, especially Lijie and Junrui
> who
> > have more context on the AdaptiveBatchScheduler.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Mon, Apr 15, 2024 at 12:54 AM Xia Sun  wrote:
> >
> > > Hi Venkat,
> > > I agree that the parallelism of source vertex should not be upper
> bounded
> > > by the job's global max parallelism. The case you mentioned, >> High
> > filter
> > > selectivity with huge amounts of data to read  excellently supports
> this
> > > viewpoint. (In fact, in the current implementation, if the source
> > > parallelism is pre-specified at job create stage, rather than relying
> on
> > > the dynamic parallelism inference of the AdaptiveBatchScheduler, the
> > source
> > > vertex's parallelism can indeed exceed the job's global max
> parallelism.)
> > >
> > > As Lijie and Junrui pointed out, the key issue is "semantic
> consistency."
> > > Currently, if a vertex has not set maxParallelism, the
> > > AdaptiveBatchScheduler will use
> > > `execution.batch.adaptive.auto-parallelism.max-parallelism` as the
> > vertex's
> > > maxParallelism. Since the current implementation does not distinguish
> > > between source vertices and downstream vertices, source vertices are
> also
> > > subject to this limitation.
> > >
> > > Therefore, I believe that if the issue of "semantic consistency" can be
> > > well explained in the code and configuration documentation, the
> > > AdaptiveBatchScheduler should support that the parallelism of source
> > > vertices can exceed the job's global max parallelism.
> > >
> > > Best,
> > > Xia
> > >
> > > Venkatakrishnan Sowrirajan  于2024年4月14日周日 10:31写道:
> > >
> > > > Let me state why I think "*jobmanager.adaptive-batch-sche*
> > > > *duler.default-source-parallelism*" should not be bound by the "
> > > > *jobmanager.adaptive-batch-sche**duler.max-parallelism*".
> > > >
> > > >- Source vertex is unique and does not have any upstream vertices
> > > >- Downstream vertices read shuffled data partitioned by key, which
> > is
> > > >not the case for the Source vertex
> > > >- Limiting source parallelism by downstream vertices' max
> > parallelism
> > > is
> > > >incorrect
> > > >
> > > > If we say for ""semantic consistency" the source vertex parallelism
> has
> > > to
> > > > be bound by the overall job's max parallelism, it can lead to
> following
> > > > issues:
> > > >
> > > >- High filter selectivity with huge amounts of data to read -
> > setting
> > > >high "*jobmanager.adaptive-batch-scheduler.max-parallelism*" so
> that
> > > >source parallelism can be set higher can lead to small blocks and
> > > >sub-optimal performance.
> > > >- Setting high
> > "*jobmanager.adaptive-batch-scheduler.max-parallelism*"
> > > >requires careful tuning of network buffer configurations which is
> > > >unnecessary in cases where it is not required just so that the
> > source
> > > >parallelism can be set high.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > > On Thu, Apr 11, 2024 at 9:30 PM Junrui Lee 
> > wrote:
> > > >
> > > > > Hello Venkata krishnan,
> &

Re: Question around Flink's AdaptiveBatchScheduler

2024-04-15 Thread Venkatakrishnan Sowrirajan
Thanks for adding your thoughts to this discussion.

If we all agree that the source vertex parallelism shouldn't be bound by
the downstream max parallelism
(jobmanager.adaptive-batch-scheduler.max-parallelism)
based on the rationale and the issues described above, I can take a stab at
addressing the issue.

Let me file a ticket to track this issue. Otherwise, I'm looking forward to
hearing more thoughts from others as well, especially Lijie and Junrui who
have more context on the AdaptiveBatchScheduler.

Regards
Venkata krishnan


On Mon, Apr 15, 2024 at 12:54 AM Xia Sun  wrote:

> Hi Venkat,
> I agree that the parallelism of source vertex should not be upper bounded
> by the job's global max parallelism. The case you mentioned, >> High filter
> selectivity with huge amounts of data to read  excellently supports this
> viewpoint. (In fact, in the current implementation, if the source
> parallelism is pre-specified at job create stage, rather than relying on
> the dynamic parallelism inference of the AdaptiveBatchScheduler, the source
> vertex's parallelism can indeed exceed the job's global max parallelism.)
>
> As Lijie and Junrui pointed out, the key issue is "semantic consistency."
> Currently, if a vertex has not set maxParallelism, the
> AdaptiveBatchScheduler will use
> `execution.batch.adaptive.auto-parallelism.max-parallelism` as the vertex's
> maxParallelism. Since the current implementation does not distinguish
> between source vertices and downstream vertices, source vertices are also
> subject to this limitation.
>
> Therefore, I believe that if the issue of "semantic consistency" can be
> well explained in the code and configuration documentation, the
> AdaptiveBatchScheduler should support that the parallelism of source
> vertices can exceed the job's global max parallelism.
>
> Best,
> Xia
>
> Venkatakrishnan Sowrirajan  于2024年4月14日周日 10:31写道:
>
> > Let me state why I think "*jobmanager.adaptive-batch-sche*
> > *duler.default-source-parallelism*" should not be bound by the "
> > *jobmanager.adaptive-batch-sche**duler.max-parallelism*".
> >
> >- Source vertex is unique and does not have any upstream vertices
> >- Downstream vertices read shuffled data partitioned by key, which is
> >not the case for the Source vertex
> >- Limiting source parallelism by downstream vertices' max parallelism
> is
> >incorrect
> >
> > If we say for ""semantic consistency" the source vertex parallelism has
> to
> > be bound by the overall job's max parallelism, it can lead to following
> > issues:
> >
> >- High filter selectivity with huge amounts of data to read - setting
> >high "*jobmanager.adaptive-batch-scheduler.max-parallelism*" so that
> >source parallelism can be set higher can lead to small blocks and
> >sub-optimal performance.
> >- Setting high "*jobmanager.adaptive-batch-scheduler.max-parallelism*"
> >requires careful tuning of network buffer configurations which is
> >unnecessary in cases where it is not required just so that the source
> >parallelism can be set high.
> >
> > Regards
> > Venkata krishnan
> >
> > On Thu, Apr 11, 2024 at 9:30 PM Junrui Lee  wrote:
> >
> > > Hello Venkata krishnan,
> > >
> > > I think the term "semantic inconsistency" defined by
> > > jobmanager.adaptive-batch-scheduler.max-parallelism refers to
> > maintaining a
> > > uniform upper limit on parallelism across all vertices within a job. As
> > the
> > > source vertices are part of the global execution graph, they should
> also
> > > respect this rule to ensure consistent application of parallelism
> > > constraints.
> > >
> > > Best,
> > > Junrui
> > >
> > > Venkatakrishnan Sowrirajan  于2024年4月12日周五 02:10写道:
> > >
> > > > Gentle bump on this question. cc @Becket Qin 
> as
> > > > well.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Tue, Mar 12, 2024 at 10:11 PM Venkatakrishnan Sowrirajan <
> > > > vsowr...@asu.edu> wrote:
> > > >
> > > > > Thanks for the response Lijie and Junrui. Sorry for the late reply.
> > Few
> > > > > follow up questions.
> > > > >
> > > > > > Source can actually ignore this limit
> > > > > because it has no upstream, but this will lead to semantic
> > > inconsistency.
> > > > >
> > > > > 

Re: [DISCUSS] FLIP-438: Make Flink's Hadoop and YARN configuration probing consistent

2024-04-15 Thread Venkatakrishnan Sowrirajan
Sorry for the late reply, Ferenc.

I understand the rationale behind the current implementation as the problem
is slightly different b/w yarn (always prefixed with `yarn`) and hadoop (it
is not guaranteed all `hadoop` configs will be prefixed by `hadoop`)
configs.

>From the dev UX perspective, it is confusing and only if you really pay
close attention to the docs it is evident. I understand your point on added
complexity till Flink-3.0 but if we agree it should be made consistent, it
has to be done at some point of time right?

Regards
Venkata krishnan


On Wed, Apr 3, 2024 at 4:51 AM Ferenc Csaky 
wrote:

> Hi Venkata,
>
> Thank you for opening the discussion about this!
>
> After taking a look at the YARN and Hadoop configurations, the
> reason why it was implemented this way is that, in case of YARN,
> every YARN-specific property is prefixed with "yarn.", so to get
> the final, YARN-side property it is enough to remove the "flink."
> prefix.
>
> In case of Hadoop, there are properties that not prefixed with
> "hadoop.", e.g. "dfs.replication" so to identify and get the
> Hadoop-side property it is necessary to duplicate the "hadoop" part
> in the properties.
>
> Taking this into consideration I would personally say -0 to this
> change. IMO the current behavior can be justified as giving
> slightly different solutions to slightly different problems, which
> are well documented. Handling both prefixes would complicate the
> parsing logic until the APIs can be removed, which as it looks at
> the moment would only be possible in Flink 3.0, which probably will
> not happen in the foreseeable future, so I do not see the benefit
> of the added complexity.
>
> Regarding the FLIP, in the "YARN configuration override example"
> part, I think you should present an example that works correctly
> at the moment: "flink.yarn.application.classpath" ->
> "yarn.application.classpath".
>
> Best,
> Ferenc
>
>
> On Friday, March 29th, 2024 at 23:45, Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
> >
> >
> > Hi Flink devs,
> >
> > I would like to start a discussion on FLIP-XXX: Make Flink's Hadoop and
> > YARN configuration probing consistent
> >
> https://urldefense.com/v3/__https://docs.google.com/document/d/1I2jBFI0eVkofAVCAEeajNQRfOqKGJsRfZd54h79AIYc/edit?usp=sharing__;!!IKRxdwAv5BmarQ!d0XJO_mzLCJZNkrjJDMyRGP95zPLW8Cuym88l7CoAUG8aD_KRYJbll3K-q1Ypplyqe6-jcsWq3S8YJqrDMCpK4IhpT4cZPXy$
> .
> >
> > This stems from an earlier discussion thread here
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/l2fh5shbf59fjgbt1h73pmmsqj038ppv__;!!IKRxdwAv5BmarQ!d0XJO_mzLCJZNkrjJDMyRGP95zPLW8Cuym88l7CoAUG8aD_KRYJbll3K-q1Ypplyqe6-jcsWq3S8YJqrDMCpK4IhpW60A99X$
> .
> >
> >
> > This FLIP is proposing to make the configuration probing behavior between
> > Hadoop and YARN configuration to be consistent.
> >
> > Regards
> > Venkata krishnan
>


Re: Question around Flink's AdaptiveBatchScheduler

2024-04-13 Thread Venkatakrishnan Sowrirajan
Let me state why I think "*jobmanager.adaptive-batch-sche*
*duler.default-source-parallelism*" should not be bound by the "
*jobmanager.adaptive-batch-sche**duler.max-parallelism*".

   - Source vertex is unique and does not have any upstream vertices
   - Downstream vertices read shuffled data partitioned by key, which is
   not the case for the Source vertex
   - Limiting source parallelism by downstream vertices' max parallelism is
   incorrect

If we say for ""semantic consistency" the source vertex parallelism has to
be bound by the overall job's max parallelism, it can lead to following
issues:

   - High filter selectivity with huge amounts of data to read - setting
   high "*jobmanager.adaptive-batch-scheduler.max-parallelism*" so that
   source parallelism can be set higher can lead to small blocks and
   sub-optimal performance.
   - Setting high "*jobmanager.adaptive-batch-scheduler.max-parallelism*"
   requires careful tuning of network buffer configurations which is
   unnecessary in cases where it is not required just so that the source
   parallelism can be set high.

Regards
Venkata krishnan

On Thu, Apr 11, 2024 at 9:30 PM Junrui Lee  wrote:

> Hello Venkata krishnan,
>
> I think the term "semantic inconsistency" defined by
> jobmanager.adaptive-batch-scheduler.max-parallelism refers to maintaining a
> uniform upper limit on parallelism across all vertices within a job. As the
> source vertices are part of the global execution graph, they should also
> respect this rule to ensure consistent application of parallelism
> constraints.
>
> Best,
> Junrui
>
> Venkatakrishnan Sowrirajan  于2024年4月12日周五 02:10写道:
>
> > Gentle bump on this question. cc @Becket Qin  as
> > well.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Mar 12, 2024 at 10:11 PM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu> wrote:
> >
> > > Thanks for the response Lijie and Junrui. Sorry for the late reply. Few
> > > follow up questions.
> > >
> > > > Source can actually ignore this limit
> > > because it has no upstream, but this will lead to semantic
> inconsistency.
> > >
> > > Lijie, can you please elaborate on the above comment further? What do
> you
> > > mean when you say it will lead to "semantic inconsistency"?
> > >
> > > > Secondly, we first need to limit the max parallelism of (downstream)
> > > vertex, and then we can decide how many subpartitions (upstream vertex)
> > > should produce. The limit should be effective, otherwise some
> downstream
> > > tasks will have no data to process.
> > >
> > > This makes sense in the context of any other vertices other than the
> > > source vertex. As you mentioned above ("Source can actually ignore this
> > > limit because it has no upstream"), therefore I feel "
> > > jobmanager.adaptive-batch-scheduler.default-source-parallelism" need
> not
> > > be upper bounded by
> > "jobmanager.adaptive-batch-scheduler.max-parallelism".
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Thu, Feb 29, 2024 at 2:11 AM Junrui Lee 
> wrote:
> > >
> > >> Hi Venkat,
> > >>
> > >> As Lijie mentioned,  in Flink, the parallelism is required to be less
> > than
> > >> or equal to the maximum parallelism. The config option
> > >> jobmanager.adaptive-batch-scheduler.max-parallelism and
> > >> jobmanager.adaptive-batch-scheduler.default-source-parallelism will be
> > set
> > >> as the source's parallelism and max-parallelism, respectively.
> > Therefore,
> > >> the check failed situation you encountered is in line with the
> > >> expectations.
> > >>
> > >> Best,
> > >> Junrui
> > >>
> > >> Lijie Wang  于2024年2月29日周四 17:35写道:
> > >>
> > >> > Hi Venkat,
> > >> >
> > >> > >> default-source-parallelism config should be independent from the
> > >> > max-parallelism
> > >> >
> > >> > Actually, it's not.
> > >> >
> > >> > Firstly, it's obvious that the parallelism should be less than or
> > equal
> > >> to
> > >> > the max parallelism(both literally and execution). The
> > >> > "jobmanager.adaptive-batch-scheduler.max-parallelism" will be used
> as
> > >> the
> > >> > max parallelism for a vertex if

Re: Question around Flink's AdaptiveBatchScheduler

2024-04-11 Thread Venkatakrishnan Sowrirajan
Gentle bump on this question. cc @Becket Qin  as well.

Regards
Venkata krishnan


On Tue, Mar 12, 2024 at 10:11 PM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> Thanks for the response Lijie and Junrui. Sorry for the late reply. Few
> follow up questions.
>
> > Source can actually ignore this limit
> because it has no upstream, but this will lead to semantic inconsistency.
>
> Lijie, can you please elaborate on the above comment further? What do you
> mean when you say it will lead to "semantic inconsistency"?
>
> > Secondly, we first need to limit the max parallelism of (downstream)
> vertex, and then we can decide how many subpartitions (upstream vertex)
> should produce. The limit should be effective, otherwise some downstream
> tasks will have no data to process.
>
> This makes sense in the context of any other vertices other than the
> source vertex. As you mentioned above ("Source can actually ignore this
> limit because it has no upstream"), therefore I feel "
> jobmanager.adaptive-batch-scheduler.default-source-parallelism" need not
> be upper bounded by "jobmanager.adaptive-batch-scheduler.max-parallelism".
>
> Regards
> Venkata krishnan
>
>
> On Thu, Feb 29, 2024 at 2:11 AM Junrui Lee  wrote:
>
>> Hi Venkat,
>>
>> As Lijie mentioned,  in Flink, the parallelism is required to be less than
>> or equal to the maximum parallelism. The config option
>> jobmanager.adaptive-batch-scheduler.max-parallelism and
>> jobmanager.adaptive-batch-scheduler.default-source-parallelism will be set
>> as the source's parallelism and max-parallelism, respectively. Therefore,
>> the check failed situation you encountered is in line with the
>> expectations.
>>
>> Best,
>> Junrui
>>
>> Lijie Wang  于2024年2月29日周四 17:35写道:
>>
>> > Hi Venkat,
>> >
>> > >> default-source-parallelism config should be independent from the
>> > max-parallelism
>> >
>> > Actually, it's not.
>> >
>> > Firstly, it's obvious that the parallelism should be less than or equal
>> to
>> > the max parallelism(both literally and execution). The
>> > "jobmanager.adaptive-batch-scheduler.max-parallelism" will be used as
>> the
>> > max parallelism for a vertex if you don't set max parallelism for it
>> > individually (Just like the source in your case).
>> >
>> > Secondly, we first need to limit the max parallelism of (downstream)
>> > vertex, and then we can decide how many subpartitions (upstream vertex)
>> > should produce. The limit should be effective, otherwise some downstream
>> > tasks will have no data to process. Source can actually ignore this
>> limit
>> > because it has no upstream, but this will lead to semantic
>> inconsistency.
>> >
>> > Best,
>> > Lijie
>> >
>> > Venkatakrishnan Sowrirajan  于2024年2月29日周四 05:49写道:
>> >
>> > > Hi Flink devs,
>> > >
>> > > With Flink's AdaptiveBatchScheduler
>> > > <
>> > >
>> >
>> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/*adaptive-batch-scheduler__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISrg5BrHLw$
>> > > >
>> > > (Note:
>> > > this is different from AdaptiveScheduler
>> > > <
>> > >
>> >
>> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/*adaptive-scheduler__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISqUzURivw$
>> > > >),
>> > > the scheduler automatically determines the correct number of
>> downstream
>> > > tasks required to process the shuffle generated by the upstream
>> vertex.
>> > >
>> > > I have a question regarding the current behavior. There are 2 configs
>> > which
>> > > are in interplay here.
>> > > 1. jobmanager.adaptive-batch-scheduler.default-source-parallelism
>> > > <
>> > >
>> >
>> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/*jobmanager-adaptive-batch-scheduler-default-source-parallelism__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISoOTMiiCA$
>> > > >
>> > >  - The default parallelism 

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-29 Thread Venkatakrishnan Sowrirajan
Ron and Lincoln,

Great proposal and interesting discussion for adding support for dynamic
tables within Flink.

At LinkedIn, we are also trying to solve compute/storage convergence for
similar problems discussed as part of this FLIP, specifically periodic
backfill, bootstrap + nearline update use cases using single implementation
of business logic (single script).

Few clarifying questions:

1. In the proposed FLIP, given the example for the dynamic table, do the
data sources always come from a single lake storage such as Paimon or does
the same proposal solve for 2 disparate storage systems like Kafka and
Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
Basically the lambda architecture that is mentioned in the FLIP as well.
I'm wondering if it is possible to switch b/w sources based on the
execution mode, for eg: if it is backfill operation, switch to a data lake
storage system like Iceberg, otherwise an event streaming system like Kafka.
2. What happens in the context of a bootstrap (batch) + nearline update
(streaming) case that are stateful applications? What I mean by that is,
will the state from the batch application be transferred to the nearline
application after the bootstrap execution is complete?

Regards
Venkata krishnan


On Mon, Mar 25, 2024 at 8:03 PM Ron liu  wrote:

> Hi, Timo
>
> Thanks for your quick response, and your suggestion.
>
> Yes, this discussion has turned into confirming whether it's a special
> table or a special MV.
>
> 1. The key problem with MVs is that they don't support modification, so I
> prefer it to be a special table. Although the periodic refresh behavior is
> more characteristic of an MV, since we are already a special table,
> supporting periodic refresh behavior is quite natural, similar to Snowflake
> dynamic tables.
>
> 2. Regarding the keyword UPDATING, since the current Regular Table is a
> Dynamic Table, which implies support for updating through Continuous Query,
> I think it is redundant to add the keyword UPDATING. In addition, UPDATING
> can not reflect the Continuous Query part, can not express the purpose we
> want to simplify the data pipeline through Dynamic Table + Continuous
> Query.
>
> 3. From the perspective of the SQL standard definition, I can understand
> your concerns about Derived Table, but is it possible to make a slight
> adjustment to meet our needs? Additionally, as Lincoln mentioned, the
> Google Looker platform has introduced Persistent Derived Table, and there
> are precedents in the industry; could Derived Table be a candidate?
>
> Of course, look forward to your better suggestions.
>
> Best,
> Ron
>
>
>
> Timo Walther  于2024年3月25日周一 18:49写道:
>
> > After thinking about this more, this discussion boils down to whether
> > this is a special table or a special materialized view. In both cases,
> > we would need to add a special keyword:
> >
> > Either
> >
> > CREATE UPDATING TABLE
> >
> > or
> >
> > CREATE UPDATING MATERIALIZED VIEW
> >
> > I still feel that the periodic refreshing behavior is closer to a MV. If
> > we add a special keyword to MV, the optimizer would know that the data
> > cannot be used for query optimizations.
> >
> > I will ask more people for their opinion.
> >
> > Regards,
> > Timo
> >
> >
> > On 25.03.24 10:45, Timo Walther wrote:
> > > Hi Ron and Lincoln,
> > >
> > > thanks for the quick response and the very insightful discussion.
> > >
> > >  > we might limit future opportunities to optimize queries
> > >  > through automatic materialization rewriting by allowing data
> > >  > modifications, thus losing the potential for such optimizations.
> > >
> > > This argument makes a lot of sense to me. Due to the updates, the
> system
> > > is not in full control of the persisted data. However, the system is
> > > still in full control of the job that powers the refresh. So if the
> > > system manages all updating pipelines, it could still leverage
> automatic
> > > materialization rewriting but without leveraging the data at rest (only
> > > the data in flight).
> > >
> > >  > we are considering another candidate, Derived Table, the term
> 'derive'
> > >  > suggests a query, and 'table' retains modifiability. This approach
> > >  > would not disrupt our current concept of a dynamic table
> > >
> > > I did some research on this term. The SQL standard uses the term
> > > "derived table" extensively (defined in section 4.17.3). Thus, a lot of
> > > vendors adopt this for simply referring to a table within a subclause:
> > >
> > >
> https://urldefense.com/v3/__https://dev.mysql.com/doc/refman/8.0/en/derived-tables.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j735ghdiMp$
> > >
> > >
> >
> https://urldefense.com/v3/__https://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc32300.1600/doc/html/san1390612291252.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j737h1gRux$
> > >
> > >
> >
> 

[DISCUSS] FLIP-438: Make Flink's Hadoop and YARN configuration probing consistent

2024-03-29 Thread Venkatakrishnan Sowrirajan
Hi Flink devs,

I would like to start a discussion on FLIP-XXX: Make Flink's Hadoop and
YARN configuration probing consistent
.
This stems from an earlier discussion thread here
.

This FLIP is proposing to make the configuration probing behavior between
Hadoop and YARN configuration to be consistent.

Regards
Venkata krishnan


Re: [DISCUSS] FLIP-441: Show the JobType and remove Execution Mode on Flink WebUI

2024-03-27 Thread Venkatakrishnan Sowrirajan
Rui,

I assume the current proposal would also handle the case of mixed mode
(BATCH + STREAMING within the same app) in the future, right?

Regards
Venkat

On Wed, Mar 27, 2024 at 10:15 AM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> This will be a very useful addition to Flink UI. Thanks Rui for starting a
> FLIP for this improvement.
>
> Regards
> Venkata krishnan
>
>
> On Wed, Mar 27, 2024 at 4:49 AM Muhammet Orazov
>  wrote:
>
>> Hello Rui,
>>
>> Thanks for the proposal! It looks good!
>>
>> I have minor clarification from my side:
>>
>> The execution mode is also used for the DataStream API [1],
>> would that also affect/hide the DataStream execution mode
>> if we remove it from the WebUI?
>>
>> Best,
>> Muhammet
>>
>> [1]:
>>
>> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/execution_mode/__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHmcYlkUE$
>>
>>
>> On 2024-03-27 06:23, Rui Fan wrote:
>> > Hi flink developers,
>> >
>> > I'd like to start a discussion to discuss FLIP-441:
>> > Show the JobType and remove Execution Mode on Flink WebUI[1].
>> >
>> > Currently, the jobType has 2 types in Flink: STREAMING and BATCH.
>> > They work on completely different principles, such as: scheduler,
>> > shuffle, join, etc. These differences lead to different troubleshooting
>> > processes, so when users are maintaining a job or troubleshooting,
>> > it's needed to know whether the current job is a STREAMING or
>> > BATCH job. Unfortunately, Flink WebUI doesn't expose it to the
>> > users so far.
>> >
>> > Also, Execution Mode is related to DataSet api, it has been marked
>> > as @Deprecated in FLINK-32258 (1.18), but it's still shown in Flink
>> > WebUI.
>> >
>> > Looking forward to hearing more thoughts about it! Thank you~
>> >
>> > [1]
>> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/x/agrPEQ__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHayPyFj8$
>> > [2]
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-32558__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHftYeOLE$
>> >
>> > Best,
>> > Rui
>>
>


Re: [DISCUSS] FLIP-441: Show the JobType and remove Execution Mode on Flink WebUI

2024-03-27 Thread Venkatakrishnan Sowrirajan
This will be a very useful addition to Flink UI. Thanks Rui for starting a
FLIP for this improvement.

Regards
Venkata krishnan


On Wed, Mar 27, 2024 at 4:49 AM Muhammet Orazov
 wrote:

> Hello Rui,
>
> Thanks for the proposal! It looks good!
>
> I have minor clarification from my side:
>
> The execution mode is also used for the DataStream API [1],
> would that also affect/hide the DataStream execution mode
> if we remove it from the WebUI?
>
> Best,
> Muhammet
>
> [1]:
>
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/execution_mode/__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHmcYlkUE$
>
>
> On 2024-03-27 06:23, Rui Fan wrote:
> > Hi flink developers,
> >
> > I'd like to start a discussion to discuss FLIP-441:
> > Show the JobType and remove Execution Mode on Flink WebUI[1].
> >
> > Currently, the jobType has 2 types in Flink: STREAMING and BATCH.
> > They work on completely different principles, such as: scheduler,
> > shuffle, join, etc. These differences lead to different troubleshooting
> > processes, so when users are maintaining a job or troubleshooting,
> > it's needed to know whether the current job is a STREAMING or
> > BATCH job. Unfortunately, Flink WebUI doesn't expose it to the
> > users so far.
> >
> > Also, Execution Mode is related to DataSet api, it has been marked
> > as @Deprecated in FLINK-32258 (1.18), but it's still shown in Flink
> > WebUI.
> >
> > Looking forward to hearing more thoughts about it! Thank you~
> >
> > [1]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/x/agrPEQ__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHayPyFj8$
> > [2]
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-32558__;!!IKRxdwAv5BmarQ!eFyqVJyje_8hu1vMSUwKGBsj8vqsFDisEvJ5AxPV0LduhhHWF3rPKYEEE-09biA0unFbfMy5AVQZMgBv1AOa5oTHftYeOLE$
> >
> > Best,
> > Rui
>


Re: Flink's treatment to "hadoop" and "yarn" configuration overrides seems unintuitive

2024-03-12 Thread Venkatakrishnan Sowrirajan
Thanks for your response. Sorry for the late reply, Ferenc.

Yes, totally agree with you on deprecating this behavior as part of 1.20.
Let me follow it up with a FLIP to deprecate the current behavior and with
a proposed solution. We can discuss further in the [DISCUSS] thread of the
FLIP.

Regards
Venkata krishnan


On Mon, Feb 26, 2024 at 11:25 PM Ferenc Csaky 
wrote:

> Thanks for the more shared details Venkata, I did not used spark widely
> myself, but if there are more examples
> like follows that approach it makes sense to comply and be
> consistent.
>
> Regarding the planned release schedule on the 2.0 wiki page [1] the
> expected releases are 1.19 -> 1.20 -> 2.0. I am not sure about how
> realistic is that or there are any chance there will be a 1.21, but even if
> not, deprecating the current behavior for even 1.20 would not hurt IMO.
>
> WDYT?
>
> Looking for other opinios as well of course.
>
> Regards,
> Ferenc
>
> [1]
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/2.0*Release__;Kw!!IKRxdwAv5BmarQ!ZafwWMtIzLzlT2CxT8XSct-ZjnETPvL0KFMGi13v1KWCEE7GXBq4NXZK-kjhRG1AZfiU0aALIPsSkEDcwzDHxrhej3L0i2YI$
>
>
> On Tuesday, February 27th, 2024 at 07:08, Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
> >
> >
> > Thanks for sharing your thoughts, Ferenc.
> >
> > Just my 2 cents, coming from Spark background that uses "spark.hadoop."
> > prefix to handle all Hadoop configs, I prefer the "flink.hadoop." prefix
> > and "flink.yarn." prefix. Generally, users use both these systems and I
> > prefer to be consistent that way. Having said that, Flink doesn't need to
> > be consistent with Spark so I am fine with the other approach as well.
> >
> > I believe in order to make backwards incompatible changes, Flink needs
> the
> > change to be in deprecated status for at least 2 minor versions which
> means
> > we will already have 2.0, therefore this can probably go in 3.0 only.
> >
> > It is still good to deprecate the current behavior and fix with the right
> > behavior and get rid of this in 3.0 totally.
> >
> > Looking for more thoughts from others in the community to make sure that
> I
> > don't miss anything. Once the discussion settles, I can start a FLIP with
> > the new proposal.
> >
> > Thanks
> > Venkat
> >
> >
> > On Mon, Feb 26, 2024, 1:09 AM Ferenc Csaky ferenc.cs...@pm.me.invalid
> >
> > wrote:
> >
> > > Hi Venkata krishnan,
> > >
> > > Thanks for starting a discussion on this topic. I completely
> > > agree with you on that, this behavior can create confusion and
> > > cause debugging sessions that could be spared with aligning how Flink
> > > parses external properties.
> > >
> > > Personally, I find the Yarn props prefixing more intuitive, but
> > > I do not have strong opinions other than prefixing configs for
> > > external systems should follow the same semantics and behavior.
> > >
> > > It would make sense to align these in Flink 2.0 IMO, but I would
> > > be curious about other opinions.
> > >
> > > On Saturday, February 24th, 2024 at 07:36, Venkatakrishnan Sowrirajan <
> > > vsowr...@asu.edu> wrote:
> > >
> > > > Gentle ping on the ^^ question to surface this back up again. Any
> > > > thoughts?
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > > On Fri, Feb 16, 2024 at 7:32 PM Venkatakrishnan Sowrirajan
> > > > vsowr...@asu.edu
> > > >
> > > > wrote:
> > > >
> > > > > Hi Flink devs,
> > > > >
> > > > > Flink supports overriding "hadoop" and "yarn" configuration. As
> part of
> > > > > the override mechanism, users have to prefix `hadoop` configs with
> "
> > > > > flink.hadoop." and the prefix will be removed, while with `yarn`
> > > > > configs
> > > > > users have to prefix it with "flink.yarn." but "flink." only is
> > > > > removed,
> > > > > not "flink.yarn.".
> > > > >
> > > > > Following is an example:
> > > > >
> > > > > 1. "Hadoop" config
> > > > >
> > > > > Hadoop config key = hadoop.tmp.dir => Flink config =
> > > > > flink.hadoop.hadoop.tmp.dir => Hadoop's configuration object would
> have
> > > > &

Re: Question around Flink's AdaptiveBatchScheduler

2024-03-12 Thread Venkatakrishnan Sowrirajan
Thanks for the response Lijie and Junrui. Sorry for the late reply. Few
follow up questions.

> Source can actually ignore this limit
because it has no upstream, but this will lead to semantic inconsistency.

Lijie, can you please elaborate on the above comment further? What do you
mean when you say it will lead to "semantic inconsistency"?

> Secondly, we first need to limit the max parallelism of (downstream)
vertex, and then we can decide how many subpartitions (upstream vertex)
should produce. The limit should be effective, otherwise some downstream
tasks will have no data to process.

This makes sense in the context of any other vertices other than the source
vertex. As you mentioned above ("Source can actually ignore this limit
because it has no upstream"), therefore I feel "
jobmanager.adaptive-batch-scheduler.default-source-parallelism" need not be
upper bounded by "jobmanager.adaptive-batch-scheduler.max-parallelism".

Regards
Venkata krishnan


On Thu, Feb 29, 2024 at 2:11 AM Junrui Lee  wrote:

> Hi Venkat,
>
> As Lijie mentioned,  in Flink, the parallelism is required to be less than
> or equal to the maximum parallelism. The config option
> jobmanager.adaptive-batch-scheduler.max-parallelism and
> jobmanager.adaptive-batch-scheduler.default-source-parallelism will be set
> as the source's parallelism and max-parallelism, respectively. Therefore,
> the check failed situation you encountered is in line with the
> expectations.
>
> Best,
> Junrui
>
> Lijie Wang  于2024年2月29日周四 17:35写道:
>
> > Hi Venkat,
> >
> > >> default-source-parallelism config should be independent from the
> > max-parallelism
> >
> > Actually, it's not.
> >
> > Firstly, it's obvious that the parallelism should be less than or equal
> to
> > the max parallelism(both literally and execution). The
> > "jobmanager.adaptive-batch-scheduler.max-parallelism" will be used as the
> > max parallelism for a vertex if you don't set max parallelism for it
> > individually (Just like the source in your case).
> >
> > Secondly, we first need to limit the max parallelism of (downstream)
> > vertex, and then we can decide how many subpartitions (upstream vertex)
> > should produce. The limit should be effective, otherwise some downstream
> > tasks will have no data to process. Source can actually ignore this limit
> > because it has no upstream, but this will lead to semantic inconsistency.
> >
> > Best,
> > Lijie
> >
> > Venkatakrishnan Sowrirajan  于2024年2月29日周四 05:49写道:
> >
> > > Hi Flink devs,
> > >
> > > With Flink's AdaptiveBatchScheduler
> > > <
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/*adaptive-batch-scheduler__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISrg5BrHLw$
> > > >
> > > (Note:
> > > this is different from AdaptiveScheduler
> > > <
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/*adaptive-scheduler__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISqUzURivw$
> > > >),
> > > the scheduler automatically determines the correct number of downstream
> > > tasks required to process the shuffle generated by the upstream vertex.
> > >
> > > I have a question regarding the current behavior. There are 2 configs
> > which
> > > are in interplay here.
> > > 1. jobmanager.adaptive-batch-scheduler.default-source-parallelism
> > > <
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/*jobmanager-adaptive-batch-scheduler-default-source-parallelism__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISoOTMiiCA$
> > > >
> > >  - The default parallelism of data source.
> > > 2. jobmanager.adaptive-batch-scheduler.max-parallelism
> > > <
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/*jobmanager-adaptive-batch-scheduler-max-parallelism__;Iw!!IKRxdwAv5BmarQ!fwD4qP-fTEiqJH9CC3AHgXbR5MJPGm7ll1dYwElK-zrtujWDWio6_yvBa4rHlaZHP_lefLs4bZQAISpOw_L_Eg$
> > > >
> > > -
> > > Upper bound of allowed parallelism to set adaptively.
> > >
> > > Currently, if "
> > > jobmanager.adaptive-batch-scheduler.default-source-paralle

Question around Flink's AdaptiveBatchScheduler

2024-02-28 Thread Venkatakrishnan Sowrirajan
Hi Flink devs,

With Flink's AdaptiveBatchScheduler

(Note:
this is different from AdaptiveScheduler
),
the scheduler automatically determines the correct number of downstream
tasks required to process the shuffle generated by the upstream vertex.

I have a question regarding the current behavior. There are 2 configs which
are in interplay here.
1. jobmanager.adaptive-batch-scheduler.default-source-parallelism

 - The default parallelism of data source.
2. jobmanager.adaptive-batch-scheduler.max-parallelism

-
Upper bound of allowed parallelism to set adaptively.

Currently, if "
jobmanager.adaptive-batch-scheduler.default-source-parallelism
"
is greater than "jobmanager.adaptive-batch-scheduler.max-parallelism
",
Flink application fails with the below message:

"Vertex's parallelism should be smaller than or equal to vertex's max
parallelism."

This is the corresponding code in Flink's DefaultVertexParallelismInfo
.
My question is, "default-source-parallelism" config should be independent
from the "max-parallelism" flag. The former controls the default source
parallelism while the latter controls the max number of partitions to write
the intermediate shuffle.

If this is true, then the above check should be fixed. Otherwise, wanted to
understand why the "default-source-parallelism` should be less than the
"max-parallelism"

Thanks
Venkat


Re: Flink's treatment to "hadoop" and "yarn" configuration overrides seems unintuitive

2024-02-26 Thread Venkatakrishnan Sowrirajan
Thanks for sharing your thoughts, Ferenc.

Just my 2 cents, coming from Spark background that uses "spark.hadoop."
prefix to handle all Hadoop configs, I prefer the "flink.hadoop." prefix
and "flink.yarn." prefix. Generally, users use both these systems and I
prefer to be consistent that way. Having said that, Flink doesn't need to
be consistent with Spark so I am fine with the other approach as well.

I believe in order to make backwards incompatible changes, Flink needs the
change to be in deprecated status for at least 2 minor versions which means
we will already have 2.0, therefore this can probably go in 3.0 only.

It is still good to deprecate the current behavior and fix with the right
behavior and get rid of this in 3.0 totally.

Looking for more thoughts from others in the community to make sure that I
don't miss anything. Once the discussion settles, I can start a FLIP with
the new proposal.

Thanks
Venkat


On Mon, Feb 26, 2024, 1:09 AM Ferenc Csaky 
wrote:

> Hi Venkata krishnan,
>
> Thanks for starting a discussion on this topic. I completely
> agree with you on that, this behavior can create confusion and
> cause debugging sessions that could be spared with aligning how Flink
> parses external properties.
>
> Personally, I find the Yarn props prefixing more intuitive, but
> I do not have strong opinions other than prefixing configs for
> external systems should follow the same semantics and behavior.
>
> It would make sense to align these in Flink 2.0 IMO, but I would
> be curious about other opinions.
>
>
>
>
> On Saturday, February 24th, 2024 at 07:36, Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
> >
> >
> > Gentle ping on the ^^ question to surface this back up again. Any
> thoughts?
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Fri, Feb 16, 2024 at 7:32 PM Venkatakrishnan Sowrirajan
> vsowr...@asu.edu
> >
> > wrote:
> >
> > > Hi Flink devs,
> > >
> > > Flink supports overriding "hadoop" and "yarn" configuration. As part of
> > > the override mechanism, users have to prefix `hadoop` configs with "
> > > flink.hadoop." and the prefix will be removed, while with `yarn`
> configs
> > > users have to prefix it with "flink.yarn." but "flink." only is
> removed,
> > > not "flink.yarn.".
> > >
> > > Following is an example:
> > >
> > > 1. "Hadoop" config
> > >
> > > Hadoop config key = hadoop.tmp.dir => Flink config =
> > > flink.hadoop.hadoop.tmp.dir => Hadoop's configuration object would have
> > > hadoop.tmp.dir*.*
> > >
> > > 2. "YARN" config
> > >
> > > YARN config key = yarn.application.classpath => Flink config =
> > > flink.yarn.yarn.application.classpath => YARN's configuration object
> > > would have yarn.yarn.application.classpath*.*
> > >
> > > Although this is documented
> > >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/*flink-yarn-__;Iw!!IKRxdwAv5BmarQ!ewtlUgGysGDiWgKPr9D1bsGDp-jLagZqppUvXAtqvbO5lNMg7QTr4y5L4OL-hTFPO1qTR1nvh4ALBEQtm0RnE6X1WTEkp0Sb$
> 
> > > properly, it feels unintuitive and it tripped me, took quite a while to
> > > understand why the above YARN configuration override was not working as
> > > expected. Is this something that should be fixed? The problem with
> fixing
> > > it is, it will become backwards incompatible. Therefore, can this be
> > > addressed as part of Flink-2.0?
> > >
> > > Any thoughts?
> > >
> > > Regards
> > > Venkata krishnan
>


Re: Flink's treatment to "hadoop" and "yarn" configuration overrides seems unintuitive

2024-02-23 Thread Venkatakrishnan Sowrirajan
Gentle ping on the ^^ question to surface this back up again. Any thoughts?

Regards
Venkata krishnan


On Fri, Feb 16, 2024 at 7:32 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi Flink devs,
>
> Flink supports overriding "hadoop" and "yarn" configuration. As part of
> the override mechanism, users have to prefix `hadoop` configs with "
> flink.hadoop." and the prefix will be removed, while with `yarn` configs
> users have to prefix it with "flink.yarn." but "flink." only is removed,
> not "flink.yarn.".
>
> Following is an example:
>
> 1. "*Hadoop*" config
>
> Hadoop config key = hadoop.tmp.dir => Flink config =
> flink.hadoop.hadoop.tmp.dir => Hadoop's configuration object would have
> hadoop.tmp.dir*.*
>
> *2. "YARN" config*
>
> YARN config key = yarn.application.classpath => Flink config =
> flink.yarn.yarn.application.classpath => YARN's configuration object
> would have yarn.yarn.application.classpath*.*
>
> Although this is documented
> <https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#flink-yarn-%3Ckey%3E>
> properly, it feels unintuitive and it tripped me, took quite a while to
> understand why the above YARN configuration override was not working as
> expected. Is this something that should be fixed? The problem with fixing
> it is, it will become backwards incompatible. Therefore, can this be
> addressed as part of Flink-2.0?
>
> Any thoughts?
>
> Regards
> Venkata krishnan
>


Flink's treatment to "hadoop" and "yarn" configuration overrides seems unintuitive

2024-02-16 Thread Venkatakrishnan Sowrirajan
Hi Flink devs,

Flink supports overriding "hadoop" and "yarn" configuration. As part of the
override mechanism, users have to prefix `hadoop` configs with "
flink.hadoop." and the prefix will be removed, while with `yarn` configs
users have to prefix it with "flink.yarn." but "flink." only is removed,
not "flink.yarn.".

Following is an example:

1. "*Hadoop*" config

Hadoop config key = hadoop.tmp.dir => Flink config =
flink.hadoop.hadoop.tmp.dir => Hadoop's configuration object would have
hadoop.tmp.dir*.*

*2. "YARN" config*

YARN config key = yarn.application.classpath => Flink config =
flink.yarn.yarn.application.classpath => YARN's configuration object would
have yarn.yarn.application.classpath*.*

Although this is documented

properly, it feels unintuitive and it tripped me, took quite a while to
understand why the above YARN configuration override was not working as
expected. Is this something that should be fixed? The problem with fixing
it is, it will become backwards incompatible. Therefore, can this be
addressed as part of Flink-2.0?

Any thoughts?

Regards
Venkata krishnan


Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-27 Thread Venkatakrishnan Sowrirajan
Thanks for the proposal, Jiabao.

I agree with Becket if a *Source* is implementing the *SupportsXXXPushDown*
(in this case *SupportsFilterPushdown*) interface, then the *Source* (in
your FLIP example which is a database) is designed to support filter
pushdown. The corresponding Source can have mechanisms built into it to
detect cases where applying the filter pushdown adds additional computation
pressure which can affect the stability of the system - if so disable it.

Could you please elaborate on the use cases where users know upfront itself
(but not detectable at the source level), that for a specific job or SQL,
where *applyFilters *could negatively affect the overall performance of the
query or the external system or any other use cases where the ***PushDown *has
to be selectively disabled for specific sources?

Regards
Venkata krishnan


On Fri, Oct 27, 2023 at 2:48 AM Jark Wu  wrote:

> Hi Becket,
>
> I checked the history of "
> *table.optimizer.source.predicate-pushdown-enabled*",
> it seems it was introduced since the legacy FilterableTableSource interface
> which might be an experiential feature at that time. I don't see the
> necessity
> of this option at the moment. Maybe we can deprecate this option and drop
> it
> in Flink 2.0[1] if it is not necessary anymore. This may help to
> simplify this discussion.
>
>
> Best,
> Jark
>
> [1]:
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-32383__;!!IKRxdwAv5BmarQ!dc-Q4Kn9OWLkpDKBZwATS0hujC6KJShXBh_sk3-W2giD8vNbfm3UdHq4mAhiXw5ITHkQSl5HYkzkCw$
>
>
>
> On Thu, 26 Oct 2023 at 10:14, Becket Qin  wrote:
>
> > Thanks for the proposal, Jiabao. My two cents below:
> >
> > 1. If I understand correctly, the motivation of the FLIP is mainly to
> make
> > predicate pushdown optional on SOME of the Sources. If so, intuitively
> the
> > configuration should be Source specific instead of general. Otherwise, we
> > will end up with general configurations that may not take effect for some
> > of the Source implementations. This violates the basic rule of a
> > configuration - it does what it says, regardless of the implementation.
> > While configuration standardization is usually a good thing, it should
> not
> > break the basic rules.
> > If we really want to have this general configuration, for the sources
> this
> > configuration does not apply, they should throw an exception to make it
> > clear that this configuration is not supported. However, that seems ugly.
> >
> > 2. I think the actual motivation of this FLIP is about "how a source
> > should implement predicate pushdown efficiently", not "whether predicate
> > pushdown should be applied to the source." For example, if a source wants
> > to avoid additional computing load in the external system, it can always
> > read the entire record and apply the predicates by itself. However, from
> > the Flink perspective, the predicate pushdown is applied, it is just
> > implemented differently by the source. So the design principle here is
> that
> > Flink only cares about whether a source supports predicate pushdown or
> not,
> > it does not care about the implementation efficiency / side effect of the
> > predicates pushdown. It is the Source implementation's responsibility to
> > ensure the predicates pushdown is implemented efficiently and does not
> > impose excessive pressure on the external system. And it is OK to have
> > additional configurations to achieve this goal. Obviously, such
> > configurations will be source specific in this case.
> >
> > 3. Regarding the existing configurations of
> *table.optimizer.source.predicate-pushdown-enabled.
> > *I am not sure why we need it. Supposedly, if a source implements a
> > SupportsXXXPushDown interface, the optimizer should push the
> corresponding
> > predicates to the Source. I am not sure in which case this configuration
> > would be used. Any ideas @Jark Wu ?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Wed, Oct 25, 2023 at 11:55 PM Jiabao Sun
> >  wrote:
> >
> >> Thanks Jane for the detailed explanation.
> >>
> >> I think that for users, we should respect conventions over
> >> configurations.
> >> Conventions can be default values explicitly specified in
> configurations,
> >> or they can be behaviors that follow previous versions.
> >> If the same code has different behaviors in different versions, it would
> >> be a very bad thing.
> >>
> >> I agree that for regular users, it is not necessary to understand all
> the
> >> configurations related to Flink.
> >> By following conventions, they can have a good experience.
> >>
> >> Let's get back to the practical situation and consider it.
> >>
> >> Case 1:
> >> The user is not familiar with the purpose of the
> >> table.optimizer.source.predicate-pushdown-enabled configuration but
> follows
> >> the convention of allowing predicate pushdown to the source by default.
> >> Just understanding the source.predicate-pushdown-enabled configuration
> >> and performing 

Re: FW: RE: Close orphaned/stale PRs

2023-10-23 Thread Venkatakrishnan Sowrirajan
Sorry for the delay. I filed
https://issues.apache.org/jira/projects/FLINK/issues/FLINK-33343 to track
and address the proposal here.

Regards
Venkata krishnan


On Tue, Oct 17, 2023 at 7:49 PM Venkatakrishnan Sowrirajan 
wrote:

> Thanks Martijn, David, Ryan and others for contributing to this great
> discussion!
>
> 1. From a project perspective, we can have a discussion about closing
>> PRs automatically that a) are not followed-up within X number of days
>> after a review and/or b) PRs that don't have a passing build and/or
>> don't follow contribution guidelines and/or C) need to be rebased
>> 2. In order to help understand which PRs are OK to get reviewed, we
>> could consider automatically adding a label "Ready for review" in case
>> 1b (passing build/contribution guidelines met) is the case.
>> 3. In order to help contributors, we could consider automatically
>> adding a label in case their PR isn't mergeable for the situations
>> that are displayed in situation 1
>
>
> I'm +1 on Martijn's proposal. We can get started on this and incrementally
> improve/amend as needed. Thanks everyone once again! Let me file tickets
> for each of the items.
>
> Regards
> Venkata krishnan
>
>
> On Thu, Oct 12, 2023 at 3:32 AM David Radley 
> wrote:
>
>> Hi everyone,
>> Martjin, I like your ideas. I think these labels will help, make it
>> obvious what work is actionable. I really feel these sort of process
>> improvements will incrementally help work to flow through appropriately.
>>
>> 2 additional thoughts – I hope these help this discussion:
>>
>>   *   A triaged label on the issue would indicate that a maintainer has
>> agreed this is a valid issue – this would be a better pool of issues for
>> contributors to pickup. I am not sure if maintainers currently do this sort
>> of work.
>>   *   I like the codeowners idea; did you find a way though this within
>> the Apache rules? An extension to this is that increasingly we are moving
>> out parts of the code from the main Flink repository to other repositories;
>> would this be doable. Could experts in those repositories be given write
>> access to those repos; so that each non core repo can work through its
>> issues and merge its prs more independently. This is how LF project Egeria
>> works with its connectors and UIS;  I guess the concern is that in ASF
>> these people would need to be  committers, or could they be a committer on
>> a subset of repos. Another way to manage who can merge prs is to gate the
>> pr process using git actions, so that if an approved approver indicates a
>> pr is good then the raiser can merge – this would give us granularity on
>> write access – PyTorch follows this sort of process.
>>
>>   kind regards, David.
>>
>>
>> From: Martijn Visser 
>> Date: Thursday, 12 October 2023 at 10:32
>> To: dev@flink.apache.org 
>> Subject: [EXTERNAL] Re: FW: RE: Close orphaned/stale PRs
>> Hi everyone,
>>
>> I'm overall +1 on Ryan's comment.
>> When we're talking about component ownership, I've started a
>> discussion on the Infra mailing list in the beginning of the year on
>> it. In principle, the "codeowners" idea goes against ASF principles.
>>
>> Let's summarize things:
>> 1. From a project perspective, we can have a discussion about closing
>> PRs automatically that a) are not followed-up within X number of days
>> after a review and/or b) PRs that don't have a passing build and/or
>> don't follow contribution guidelines and/or C) need to be rebased
>> 2. In order to help understand which PRs are OK to get reviewed, we
>> could consider automatically adding a label "Ready for review" in case
>> 1b (passing build/contribution guidelines met) is the case.
>> 3. In order to help contributors, we could consider automatically
>> adding a label in case their PR isn't mergeable for the situations
>> that are displayed in situation 1
>>
>> When that's done, we can see what the effect is on the PRs queue.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Wed, Oct 4, 2023 at 5:13 PM David Radley 
>> wrote:
>> >
>> > Hi Ryan,
>> >
>> > I agree that good communication is key to determining what can be
>> worked on.
>> >
>> > In terms of metrics , we can use the gh cli to list prs and we can
>> export issues from Jira. A view across them, you could join on the Flink
>> issue (at the start of the pr comment and the flink issue itself – you
>> could then see which prs have an assigned Jira would be expecte

Re: FW: RE: Close orphaned/stale PRs

2023-10-17 Thread Venkatakrishnan Sowrirajan
lassian.jira.jira-core-reports-plugin*3Acreatedvsresolved-report_token=A5KQ-2QAV-T4JA-FDED_19ff17decb93662bafa09e4b3ffb3a385c202015_lin=Next__;JQ!!IKRxdwAv5BmarQ!ecidqKPH8p4_x35QVQoYRVAFoVKPVkGAcCpCrTX0QgXCsaK2FNiN6RMgfGJtpqA17JBD3G1P3H9B8KaqXLEGGgk$
> > > Gaining over 500 issues to the backlog every 3 months.
> > >
> > > We have over 1000 open prs. This is a lot of technical debt. I came
> across a 6 month old pr recently that had not been merged. A second Jira
> issue was raised  for the same problem and a second pr fixed the issue
> (identically). The first pr was still on the backlog until we noticed it.
> > >
> > > I am looking to contribute to the community to be able to identify
> issues I can work on and then be reasonably certain they will be reviewed
> and merged so I can build on contributions. I have worked as a maintainer
> and committer in other communities and managed to spend some of the week
> addressing incoming work; I am happy to do this in some capacity with the
> support of committer(s) for Flink.  It seems to me it is virtuous circle to
> enable more contributions, to get more committers , builds those committers
> that can help merge and review the backlog.
> > >
> > > Some thoughts ( I am new to this – so apologise if I have
> misunderstood something or am unaware of other existing mechanisms) :
> > >
> > >   1.  If there is an issue that a committer has assigned to a
> contributor as per the process<
> https://urldefense.com/v3/__https://flink.apache.org/how-to-contribute/contribute-code/__;!!IKRxdwAv5BmarQ!ecidqKPH8p4_x35QVQoYRVAFoVKPVkGAcCpCrTX0QgXCsaK2FNiN6RMgfGJtpqA17JBD3G1P3H9B8KaqW7vDE-k$
>   > , and there is a pr then it should be with the committer to review the
> pr, or return it to the work queue. I do not know how many prs are like
> this. It seems to me that if a committer assigns an issue, they are
> indicating they will review, unassign themselves or merge. I do not think
> these prs should be closed as stale.
> > >   2.  Could we have a Git action to notify committers (tagged in the
> pr?) if a pr (that has an assigned Jira)  has not been reviewed in a
> certain period (7 days?) then subsequent nags if there has been no response
> . In this way busy committers can see that a pr needs looking at.
> > >   3.  Other prs have been raised without a committer saying that they
> will fix it.  In this case there is likely to be value, but the merging and
> review work has not been taken on by anyone. I notice spelling mistake prs
> that have not been merged (there are 8 with this query
> https://urldefense.com/v3/__https://github.com/apache/flink/pulls?q=is*3Apr*is*3Aopen*spelling__;JSslKw!!IKRxdwAv5BmarQ!ecidqKPH8p4_x35QVQoYRVAFoVKPVkGAcCpCrTX0QgXCsaK2FNiN6RMgfGJtpqA17JBD3G1P3H9B8KaqlIlk1_k$
>   ) , these are typical newbee prs as they are simple but useful
> improvements.; it would be great if these simpler ones could just be merged
> – maybe they should be marked as a [hotfix] to indicate they should be
> merged.  If simpler prs are not merged – it is very difficult for new
> contributors to gain eminence to get towards being a committer.
> > >   4.  There are also issues that have been raised by people who do not
> want to fix them. It seems to me that we need a “triaged” state to indicate
> the issue looks valid and reasonable, so could be picked up by someone – at
> which time they would need to agree with a committer to get the associated
> pr reviewed and merged. This triaged state would be a pool of issues that
> new contributors to choose from
> > >
> > >
> > >
> > > I am happy to help to improve – once we have consensus,
> > >
> > >
> > >
> > > Kind regards, David.
> > >
> > >
> > >
> > >
> > > From: Venkatakrishnan Sowrirajan 
> > > Date: Wednesday, 4 October 2023 at 00:36
> > > To: dev@flink.apache.org 
> > > Subject: [EXTERNAL] Re: Close orphaned/stale PRs
> > > Gentle ping to surface this up for more discussions.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Tue, Sep 26, 2023 at 4:59 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> > > wrote:
> > >
> > > > Hi Martijn,
> > > >
> > > > Agree with your point that closing a PR without any review feedback
> even
> > > > after 'X' days is discouraging to a new contributor. I understand
> that this
> > > > is a capacity problem. Capacity problem cannot be solved by this
> proposal
> > > > and it is beyond the scope of this proposal.
> > > >
> > 

Re: [ANNOUNCE] New Apache Flink Committer - Jane Chan

2023-10-17 Thread Venkatakrishnan Sowrirajan
Congratulations Jane!

Regards
Venkata krishnan


On Tue, Oct 17, 2023 at 6:45 AM Matthias Pohl
 wrote:

> Congratulations :)
>
> On Mon, Oct 16, 2023 at 1:04 PM Jing Ge 
> wrote:
>
> > Congrats!
> >
> > Best regards,
> > Jing
> >
> > On Mon, Oct 16, 2023 at 10:33 AM Yuepeng Pan 
> > wrote:
> >
> > > Congratulations, Jane !
> > >
> > >
> > >
> > > Best,
> > > Yuepeng Pan
> > >
> > > At 2023-10-16 09:58:02, "Jark Wu"  wrote:
> > > >Hi, everyone
> > > >
> > > >On behalf of the PMC, I'm very happy to announce Jane Chan as a new
> > Flink
> > > >Committer.
> > > >
> > > >Jane started code contribution in Jan 2021 and has been active in the
> > > Flink
> > > >community since. She authored more than 60 PRs and reviewed more than
> 40
> > > >PRs. Her contribution mainly revolves around Flink SQL, including Plan
> > > >Advice (FLIP-280), operator-level state TTL (FLIP-292), and ALTER
> TABLE
> > > >statements (FLINK-21634). Jane participated deeply in development
> > > >discussions and also helped answer user question emails. Jane was
> also a
> > > >core contributor of Flink Table Store (now Paimon) when the project
> was
> > in
> > > >the early days.
> > > >
> > > >Please join me in congratulating Jane Chan for becoming a Flink
> > Committer!
> > > >
> > > >Best,
> > > >Jark Wu (on behalf of the Flink PMC)
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Ron Liu

2023-10-16 Thread Venkatakrishnan Sowrirajan
Congrats Ron!

On Mon, Oct 16, 2023, 9:34 AM David Radley  wrote:

> Congratulations Ron!
>
> From: Jark Wu 
> Date: Sunday, 15 October 2023 at 18:57
> To: dev 
> Cc: ron9@gmail.com 
> Subject: [EXTERNAL] [ANNOUNCE] New Apache Flink Committer - Ron Liu
> Hi, everyone
>
> On behalf of the PMC, I'm very happy to announce Ron Liu as a new Flink
> Committer.
>
> Ron has been continuously contributing to the Flink project for many years,
> authored and reviewed a lot of codes. He mainly works on Flink SQL parts
> and drove several important FLIPs, e.g., USING JAR (FLIP-214), Operator
> Fusion CodeGen (FLIP-315), Runtime Filter (FLIP-324). He has a great
> knowledge of the Batch SQL and improved a lot of batch performance in the
> past several releases. He is also quite active in mailing lists,
> participating in discussions and answering user questions.
>
> Please join me in congratulating Ron Liu for becoming a Flink Committer!
>
> Best,
> Jark Wu (on behalf of the Flink PMC)
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
>


Re: Close orphaned/stale PRs

2023-10-03 Thread Venkatakrishnan Sowrirajan
Gentle ping to surface this up for more discussions.

Regards
Venkata krishnan


On Tue, Sep 26, 2023 at 4:59 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi Martijn,
>
> Agree with your point that closing a PR without any review feedback even
> after 'X' days is discouraging to a new contributor. I understand that this
> is a capacity problem. Capacity problem cannot be solved by this proposal
> and it is beyond the scope of this proposal.
>
> Regarding your earlier question,
> > What's the added value of
> closing these PRs
>
>- Having lots of inactive PRs lingering around shows the project is
>less active. I am not saying this is the only way to determine how active a
>project is, but this is one of the key factors.
>- A large number of PRs open can be discouraging for (new)
>contributors but on the other hand I agree closing an inactive PR without
>any reviews can also drive contributors away.
>
> Having said all of that, I agree closing PRs that don't have any reviews
> to start with should be avoided from the final proposal.
>
> > I'm +1 for (automatically) closing up PRs after X days which:
> a) Don't have a CI that has passed
> b) Don't follow the code contribution guide (like commit naming
> conventions)
> c) Have changes requested but aren't being followed-up by the contributor
>
> In general, I'm largely +1 on your above proposal except for the
> implementation feasibility.
>
> Also, I have picked a few other popular projects that have implemented the
> Github's actions stale rule to see if we can borrow some ideas. Below
> projects are listed in the order of the most invasive (for lack of a better
> word) to the least invasive actions taken wrt PR without any updates for a
> long period of time.
>
> 1. Trino
>
> TL;DR - No updates in the PR for the last 21 days, tag other maintainers
> for review. If there are no updates for 21 days after that, close the PR
> with this message - "*Closing this pull request, as it has been stale for
> six weeks. Feel free to re-open at any time.*"
> Trino's stale PR Github action rule (stale.yaml)
> <https://github.com/trinodb/trino/blob/master/.github/workflows/stale.yml>
>
>
> 2. Apache Spark
>
> TL;DR - No updates in the PR in the last 100 days, closing the PR with
> this message - "*We're closing this PR because it hasn't been updated in
> a while. This isn't a judgement on the merit of the PR in any way. It's
> just a way of keeping the PR queue manageable. If you'd like to revive this
> PR, please reopen it and ask a committer to remove the Stale tag!*"
> Spark's discussion in their mailing list
> <https://lists.apache.org/thread/yg3ggtvpt2dbjpnb2q0yblq30sc1g2yx> on
> closing stale PRs. Spark's stale PR github action rule (stale.yaml
> <https://github.com/apache/spark/blob/master/.github/workflows/stale.yml>
> ).
>
> 3. Python
>
> TL;DR - No updates in the PR for the last 30 days, then tag the PR as
> stale. Note: Python project *doesn't* close the stale PRs.
>
> Python discussion
> <https://discuss.python.org/t/decision-needed-should-we-close-stale-prs-and-how-many-lapsed-days-are-prs-considered-stale/4637>
> in the mailing list to close stale PRs. Python's stale PR github action
> rule (stale.yaml
> <https://github.com/python/cpython/blob/main/.github/workflows/stale.yml>)
>
> Few others Apache Beam
> <https://github.com/apache/beam/blob/master/.github/workflows/stale.yml> 
> (closes
> inactive PRs after 60+ days), Apache Airflow
> <https://github.com/apache/airflow/blob/main/.github/workflows/stale.yml> 
> (closes
> inactive PRs after 50 days)
>
> Let me know what you think. Looking forward to hearing from others in the
> community and their experiences.
>
> [1] Github Action - Close Stale Issues -
> https://github.com/marketplace/actions/close-stale-issues
>
> Regards
> Venkata krishnan
>
>
> On Thu, Sep 21, 2023 at 6:03 AM Martijn Visser 
> wrote:
>
>> Hi all,
>>
>> I really believe that the problem of the number of open PRs is just
>> that there aren't enough reviewers/resources available to review them.
>>
>> > Stale PRs can clutter the repository, and closing them helps keep it
>> organized and ensures that only relevant and up-to-date PRs are present.
>>
>> Sure, but what's the indicator that the PR is stale? The fact that
>> there has been no reviewer yet to review it, doesn't mean that the PR
>> is stale. For me, a stale PR is a PR that has been reviewed, changes
>> have been requested and the contributor isn't participating in the
>> discussion anymore. But that's a different story compared to closing
>&

Re: Close orphaned/stale PRs

2023-09-26 Thread Venkatakrishnan Sowrirajan
her the PR will be
> continually worked on and eventually get a closure or not and therefore
> will be closed.
>
> Having other PRs being closed doesn't increase the guarantee that
> other PRs will be reviewed. It's still a capacity problem.
>
> > It would be demotivating for any contributor when there is no feedback
> for a PR within a sufficient period of time anyway.
>
> Definitely. But I think it would be even worse if someone makes a
> contribution, there is no response but after X days they get a message
> that their PR was closed automatically.
>
> I'm +1 for (automatically) closing up PRs after X days which:
> a) Don't have a CI that has passed
> b) Don't follow the code contribution guide (like commit naming
> conventions)
> c) Have changes requested but aren't being followed-up by the contributor
>
> I'm -1 for automatically closing PRs where no maintainers have taken a
> review for the reasons I've listed above.
>
> Best regards,
>
> Martijn
>
> On Wed, Sep 20, 2023 at 7:41 AM Venkatakrishnan Sowrirajan
>  wrote:
> >
> > Thanks for your response, Martijn.
> >
> > > What's the added value of
> > closing these PRs
> >
> > It mainly helps the project maintainers/reviewers to focus on only the
> > actively updated trimmed list of PRs that are ready for review.
> >
> > It helps Flink users who are waiting on a PR that enhances an existing
> > feature or fixes an issue a clear indication on whether the PR will be
> > continually worked on and eventually get a closure or not and therefore
> > will be closed.
> >
> > Btw, I am open to other suggestions or enhancements on top of the
> proposal
> > as well.
> >
> > > it would
> > just close PRs where maintainers haven't been able to perform a
> > review, but getting a PR closed without any feedback is also
> > demotivating for a (potential new) contributor
> >
> > It would be demotivating for any contributor when there is no feedback
> for
> > a PR within a sufficient period of time anyway. I don't see closing the
> PR
> > which is inactive after a sufficient period of time (say 60 to 90 days)
> > would be any more discouraging than not getting any feedback. The problem
> > of not getting feedback due to not enough maintainer's bandwidth has to
> be
> > solved through other mechanisms.
> >
> > > I think the important
> > thing is that we get into a cycle where maintainers can see which PRs
> > are ready for review, and also a way to divide the bulk of the work.
> >
> > Yes, exactly my point as well. It helps the maintainers to see a trimmed
> > list which is ready to be reviewed.
> >
> > +1 for the other automation to nudge/help the contributor to fix the PR
> > that follows the contribution guide, CI checks passed etc.
> >
> > > IIRC we can't really fix that until we can
> > finally move to dedicated Github Action Runners instead of the current
> > setup with Azure, but that's primarily blocked by ASF Infra.
> >
> > Curious, if you can share the JIRA or prior discussion on this topic. I
> > would like to learn more about why Github Actions cannot be used for
> Apache
> > Flink.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Sep 19, 2023 at 2:00 PM Martijn Visser  >
> > wrote:
> >
> > > Hi Venkata,
> > >
> > > Thanks for opening the discussion, I've been thinking about it quite a
> > > bit but I'm not sure what's the right approach.
> > >
> > > From your proposal, the question would be "What's the added value of
> > > closing these PRs"? I don't see an immediate value of that: it would
> > > just close PRs where maintainers haven't been able to perform a
> > > review, but getting a PR closed without any feedback is also
> > > demotivating for a (potential new) contributor. I think the important
> > > thing is that we get into a cycle where maintainers can see which PRs
> > > are ready for review, and also a way to divide the bulk of the work.
> > > Because doing proper reviews requires time, and these resources are
> > > scarce.
> > >
> > > I do think that we can make lives a bit easier with some automation:
> > > * There are a lot of PRs which don't follow the contribution guide (No
> > > Jira ticket, no correct commit message etc). For the externalized
> > > connector repositories, we've been trying Boring Cyborg to provide
> > > information back to contributors if their PRs are as expected. If the
> > > PR doesn'

Re: [VOTE] FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data

2023-09-25 Thread Venkatakrishnan Sowrirajan
+1 (non-binding)

On Sun, Sep 24, 2023, 6:49 PM Xintong Song  wrote:

> +1 (binding)
>
> Best,
>
> Xintong
>
>
>
> On Sat, Sep 23, 2023 at 10:16 PM Yuepeng Pan 
> wrote:
>
> > +1(non-binding), thank you for driving this proposal.
> >
> > Best,
> > Yuepeng Pan.
> > At 2023-09-22 14:07:45, "Dong Lin"  wrote:
> > >Hi all,
> > >
> > >We would like to start the vote for FLIP-327: Support switching from
> batch
> > >to stream mode to improve throughput when processing backlog data [1].
> > This
> > >FLIP was discussed in this thread [2].
> > >
> > >The vote will be open until at least Sep 27th (at least 72
> > >hours), following the consensus voting process.
> > >
> > >Cheers,
> > >Xuannan and Dong
> > >
> > >[1]
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-327*3A*Support*switching*from*batch*to*stream*mode*to*improve*throughput*when*processing*backlog*data__;JSsrKysrKysrKysrKysr!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKResPPc7$
> > >[2]
> https://urldefense.com/v3/__https://lists.apache.org/thread/29nvjt9sgnzvs90browb8r6ng31dcs3n__;!!IKRxdwAv5BmarQ!ZP19r7-3IBaBI-kV0olZvaz5a2TFN3uxge2TJM7WQvovjfRbOl71NaC3SEh_UEJmH7Lssqu0bx4FKZEAz9yp$
> >
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-22 Thread Venkatakrishnan Sowrirajan
Thanks Martijn.

Regards
Venkata krishnan


On Fri, Sep 22, 2023 at 1:51 AM Martijn Visser 
wrote:

> Sure thing, I've made the necessary changes. Thnx for clarifying
>
> On Thu, Sep 21, 2023 at 8:24 PM Venkatakrishnan Sowrirajan
>  wrote:
> >
> > Got it, Martijn.
> >
> > Unfortunately, I don't have edit access to the already created JIRA -
> > FLINK-20767 <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-20767__;!!IKRxdwAv5BmarQ!c8v3R7qHJbTKwloBaUOXL-W6HQU65q11mB4vgpETHmUEkDx0nPNXpt_ZZceCQtPpnAgrau3ua42_YMRK6jvX0h8h$
> >. If you can
> > remove the task from the EPIC FLINK-16987 FLIP-95: Add new table source
> and
> > sink interfaces <
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/FLINK-16987__;!!IKRxdwAv5BmarQ!c8v3R7qHJbTKwloBaUOXL-W6HQU65q11mB4vgpETHmUEkDx0nPNXpt_ZZceCQtPpnAgrau3ua42_YMRK6t5j3KbX$
> >, can
> > you please change it?
> >
> > If not, I can open a new ticket, close this one and link the 2 tickets as
> > duplicated by.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Thu, Sep 21, 2023 at 12:40 AM Martijn Visser <
> martijnvis...@apache.org>
> > wrote:
> >
> > > Hi Venkatakrishnan,
> > >
> > > The reason why I thought it's abandoned because the Jira ticket is
> > > part of the umbrella ticket for FLIP-95. Let's move the Jira ticket to
> > > its own dedicated task instead of nested to a FLIP-95 ticket.
> > >
> > > Thanks,
> > >
> > > Martijn
> > >
> > > On Wed, Sep 20, 2023 at 4:34 PM Becket Qin 
> wrote:
> > > >
> > > > Hi Martijn,
> > > >
> > > > This FLIP has passed voting[1]. It is a modification on top of the
> > > FLIP-95
> > > > interface.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > [1]
> > >
> https://urldefense.com/v3/__https://lists.apache.org/thread/hysv9y1f48gtpr5vx3x40wtjb6cp9ky6__;!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc28-X111i$
> > > >
> > > > On Wed, Sep 20, 2023 at 9:29 PM Martijn Visser <
> martijnvis...@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > For clarity purposes, this FLIP is being abandoned because it was
> part
> > > > > of FLIP-95?
> > > > >
> > > > > On Thu, Sep 7, 2023 at 3:01 AM Venkatakrishnan Sowrirajan
> > > > >  wrote:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Posted a PR (
> > >
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/23313__;!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc2-0vM_Ac$
> > > ) to add nested
> > > > > > fields filter pushdown. Please review. Thanks.
> > > > > >
> > > > > > Regards
> > > > > > Venkata krishnan
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 5, 2023 at 10:04 PM Venkatakrishnan Sowrirajan <
> > > > > vsowr...@asu.edu>
> > > > > > wrote:
> > > > > >
> > > > > > > Based on an offline discussion with Becket Qin, I added
> > > *fieldIndices *
> > > > > > > back which is the field index of the nested field at every
> level to
> > > > > the *NestedFieldReferenceExpression
> > > > > > > *in FLIP-356
> > > > > > > <
> > > > >
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc27ttpcO0$
> > > > > >
> > > > > > > *. *2 reasons to do it:
> > > > > > >
> > > > > > > 1. Agree with using *fieldIndices *as the only contract to
> refer
> > > to the
> > > > > > > column from the underlying datasource.
> > > > > > > 2. To keep it consistent with *FieldReferenceExpression*
> > > > > > >
> > > > > > > Having said that, I see that with *projection pushdown, *index
> of
> > > the
> > > > > > > fields are used whereas with *filter pushdown (*based on
> sc

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-21 Thread Venkatakrishnan Sowrirajan
Got it, Martijn.

Unfortunately, I don't have edit access to the already created JIRA -
FLINK-20767 <https://issues.apache.org/jira/browse/FLINK-20767>. If you can
remove the task from the EPIC FLINK-16987 FLIP-95: Add new table source and
sink interfaces <https://issues.apache.org/jira/browse/FLINK-16987>, can
you please change it?

If not, I can open a new ticket, close this one and link the 2 tickets as
duplicated by.

Regards
Venkata krishnan


On Thu, Sep 21, 2023 at 12:40 AM Martijn Visser 
wrote:

> Hi Venkatakrishnan,
>
> The reason why I thought it's abandoned because the Jira ticket is
> part of the umbrella ticket for FLIP-95. Let's move the Jira ticket to
> its own dedicated task instead of nested to a FLIP-95 ticket.
>
> Thanks,
>
> Martijn
>
> On Wed, Sep 20, 2023 at 4:34 PM Becket Qin  wrote:
> >
> > Hi Martijn,
> >
> > This FLIP has passed voting[1]. It is a modification on top of the
> FLIP-95
> > interface.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > [1]
> https://urldefense.com/v3/__https://lists.apache.org/thread/hysv9y1f48gtpr5vx3x40wtjb6cp9ky6__;!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc28-X111i$
> >
> > On Wed, Sep 20, 2023 at 9:29 PM Martijn Visser  >
> > wrote:
> >
> > > For clarity purposes, this FLIP is being abandoned because it was part
> > > of FLIP-95?
> > >
> > > On Thu, Sep 7, 2023 at 3:01 AM Venkatakrishnan Sowrirajan
> > >  wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > Posted a PR (
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/23313__;!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc2-0vM_Ac$
> ) to add nested
> > > > fields filter pushdown. Please review. Thanks.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Tue, Sep 5, 2023 at 10:04 PM Venkatakrishnan Sowrirajan <
> > > vsowr...@asu.edu>
> > > > wrote:
> > > >
> > > > > Based on an offline discussion with Becket Qin, I added
> *fieldIndices *
> > > > > back which is the field index of the nested field at every level to
> > > the *NestedFieldReferenceExpression
> > > > > *in FLIP-356
> > > > > <
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!f6gg5kDGU-9tqM2rGpbK81L_7sOscG4aVKSjp0RchK1pevsMz_YrhlStS1tXduiLHoxjMP8_BjpJYrQc27ttpcO0$
> > > >
> > > > > *. *2 reasons to do it:
> > > > >
> > > > > 1. Agree with using *fieldIndices *as the only contract to refer
> to the
> > > > > column from the underlying datasource.
> > > > > 2. To keep it consistent with *FieldReferenceExpression*
> > > > >
> > > > > Having said that, I see that with *projection pushdown, *index of
> the
> > > > > fields are used whereas with *filter pushdown (*based on scanning
> few
> > > > > tablesources) *FieldReferenceExpression*'s name is used for eg:
> even in
> > > > > the Flink's *FileSystemTableSource, IcebergSource, JDBCDatsource*.
> This
> > > > > way, I feel the contract is not quite clear and explicit. Wanted to
> > > > > understand other's thoughts as well.
> > > > >
> > > > > Regards
> > > > > Venkata krishnan
> > > > >
> > > > >
> > > > > On Tue, Sep 5, 2023 at 5:34 PM Becket Qin 
> > > wrote:
> > > > >
> > > > >> Hi Venkata,
> > > > >>
> > > > >>
> > > > >> > Also I made minor changes to the
> *NestedFieldReferenceExpression,
> > > > >> *instead
> > > > >> > of *fieldIndexArray* we can just do away with *fieldNames *array
> > > that
> > > > >> > includes fieldName at every level for the nested field.
> > > > >>
> > > > >>
> > > > >> I don't think keeping only the field names array would work. At
> the
> > > end of
> > > > >> the day, the contract between Flink SQL and the connectors is
> based
> > > on the
> > > > >> indexes, not the names. Technically speaking, the connectors only
> > > emit a
&g

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-20 Thread Venkatakrishnan Sowrirajan
No, this is not abandoned. This is accepted with enough binding votes. I
didn't get why you think that this FLIP is abandoned. Could you please
clarify?

On Wed, Sep 20, 2023, 6:11 AM Martijn Visser 
wrote:

> For clarity purposes, this FLIP is being abandoned because it was part
> of FLIP-95?
>
> On Thu, Sep 7, 2023 at 3:01 AM Venkatakrishnan Sowrirajan
>  wrote:
> >
> > Hi everyone,
> >
> > Posted a PR (
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/23313__;!!IKRxdwAv5BmarQ!bKDafb_iRxZNnrFgbAXlTZnZVaYGrfO5tgH2Ppe_aWDCOqcYA1yTESHMqC0GH-MNhpWzu3TufV13bcBohNmkCEFj$
> ) to add nested
> > fields filter pushdown. Please review. Thanks.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Sep 5, 2023 at 10:04 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> > wrote:
> >
> > > Based on an offline discussion with Becket Qin, I added *fieldIndices *
> > > back which is the field index of the nested field at every level to
> the *NestedFieldReferenceExpression
> > > *in FLIP-356
> > > <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!bKDafb_iRxZNnrFgbAXlTZnZVaYGrfO5tgH2Ppe_aWDCOqcYA1yTESHMqC0GH-MNhpWzu3TufV13bcBohH8nNZ1n$
> >
> > > *. *2 reasons to do it:
> > >
> > > 1. Agree with using *fieldIndices *as the only contract to refer to the
> > > column from the underlying datasource.
> > > 2. To keep it consistent with *FieldReferenceExpression*
> > >
> > > Having said that, I see that with *projection pushdown, *index of the
> > > fields are used whereas with *filter pushdown (*based on scanning few
> > > tablesources) *FieldReferenceExpression*'s name is used for eg: even in
> > > the Flink's *FileSystemTableSource, IcebergSource, JDBCDatsource*. This
> > > way, I feel the contract is not quite clear and explicit. Wanted to
> > > understand other's thoughts as well.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Tue, Sep 5, 2023 at 5:34 PM Becket Qin 
> wrote:
> > >
> > >> Hi Venkata,
> > >>
> > >>
> > >> > Also I made minor changes to the *NestedFieldReferenceExpression,
> > >> *instead
> > >> > of *fieldIndexArray* we can just do away with *fieldNames *array
> that
> > >> > includes fieldName at every level for the nested field.
> > >>
> > >>
> > >> I don't think keeping only the field names array would work. At the
> end of
> > >> the day, the contract between Flink SQL and the connectors is based
> on the
> > >> indexes, not the names. Technically speaking, the connectors only
> emit a
> > >> bunch of RowData which is based on positions. The field names are
> added by
> > >> the SQL framework via the DDL for those RowData. In this sense, the
> > >> connectors may not be aware of the field names in Flink DDL at all.
> The
> > >> common language between Flink SQL and source is just positions. This
> is
> > >> also why ProjectionPushDown would work by only relying on the
> indexes, not
> > >> the field names. So I think the field index array is a must have here
> in
> > >> the NestedFieldReferenceExpression.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Fri, Sep 1, 2023 at 8:12 AM Venkatakrishnan Sowrirajan <
> > >> vsowr...@asu.edu>
> > >> wrote:
> > >>
> > >> > Gentle ping on the vote for FLIP-356: Support Nested fields filter
> > >> pushdown
> > >> > <
> > >>
> https://urldefense.com/v3/__https://www.mail-archive.com/dev@flink.apache.org/msg69289.html__;!!IKRxdwAv5BmarQ!bOW26WlafOQQcb32eWtUiXBAl0cTCK1C6iYhDI2f_z__eczudAWmTRvjDiZg6gzlXmPXrDV4KJS5cFxagFE$
> > >> >.
> > >> >
> > >> > Regards
> > >> > Venkata krishnan
> > >> >
> > >> >
> > >> > On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan <
> > >> > vsowr...@asu.edu>
> > >> > wrote:
> > >> >
> > >> > > Sure, will reference this discussion to resume where we started as
> > >> part
> > >> > of
> > >> > > the flip to refactor SupportsProjectionPushDown.
> > >> > >

Re: Close orphaned/stale PRs

2023-09-19 Thread Venkatakrishnan Sowrirajan
Thanks for your response, Martijn.

> What's the added value of
closing these PRs

It mainly helps the project maintainers/reviewers to focus on only the
actively updated trimmed list of PRs that are ready for review.

It helps Flink users who are waiting on a PR that enhances an existing
feature or fixes an issue a clear indication on whether the PR will be
continually worked on and eventually get a closure or not and therefore
will be closed.

Btw, I am open to other suggestions or enhancements on top of the proposal
as well.

> it would
just close PRs where maintainers haven't been able to perform a
review, but getting a PR closed without any feedback is also
demotivating for a (potential new) contributor

It would be demotivating for any contributor when there is no feedback for
a PR within a sufficient period of time anyway. I don't see closing the PR
which is inactive after a sufficient period of time (say 60 to 90 days)
would be any more discouraging than not getting any feedback. The problem
of not getting feedback due to not enough maintainer's bandwidth has to be
solved through other mechanisms.

> I think the important
thing is that we get into a cycle where maintainers can see which PRs
are ready for review, and also a way to divide the bulk of the work.

Yes, exactly my point as well. It helps the maintainers to see a trimmed
list which is ready to be reviewed.

+1 for the other automation to nudge/help the contributor to fix the PR
that follows the contribution guide, CI checks passed etc.

> IIRC we can't really fix that until we can
finally move to dedicated Github Action Runners instead of the current
setup with Azure, but that's primarily blocked by ASF Infra.

Curious, if you can share the JIRA or prior discussion on this topic. I
would like to learn more about why Github Actions cannot be used for Apache
Flink.

Regards
Venkata krishnan


On Tue, Sep 19, 2023 at 2:00 PM Martijn Visser 
wrote:

> Hi Venkata,
>
> Thanks for opening the discussion, I've been thinking about it quite a
> bit but I'm not sure what's the right approach.
>
> From your proposal, the question would be "What's the added value of
> closing these PRs"? I don't see an immediate value of that: it would
> just close PRs where maintainers haven't been able to perform a
> review, but getting a PR closed without any feedback is also
> demotivating for a (potential new) contributor. I think the important
> thing is that we get into a cycle where maintainers can see which PRs
> are ready for review, and also a way to divide the bulk of the work.
> Because doing proper reviews requires time, and these resources are
> scarce.
>
> I do think that we can make lives a bit easier with some automation:
> * There are a lot of PRs which don't follow the contribution guide (No
> Jira ticket, no correct commit message etc). For the externalized
> connector repositories, we've been trying Boring Cyborg to provide
> information back to contributors if their PRs are as expected. If the
> PR doesn't follow the contribution guide, I'm included to give such a
> PR less attention review. That's primarily because there are other PRs
> out there that do follow these guides.
> * There are even more PRs where the CI has failed: in those cases, a
> review also makes less sense, given that the PR can't be merged as is.
> I do see that contributors sometimes don't know where to look for the
> status of the CI, but IIRC we can't really fix that until we can
> finally move to dedicated Github Action Runners instead of the current
> setup with Azure, but that's primarily blocked by ASF Infra.
>
> I'm curious what others in the community think.
>
> Best regards,
>
> Martijn
>
> On Tue, Sep 19, 2023 at 10:33 PM Venkatakrishnan Sowrirajan
>  wrote:
> >
> > Hi Flink devs,
> >
> > There are currently over 1,000 open pull requests
> > <
> https://urldefense.com/v3/__https://github.com/apache/flink/pulls?q=is*3Aopen*is*3Apr*sort*3Aupdated-asc__;JSslKyU!!IKRxdwAv5BmarQ!eG42_TepSvUmlfJU0BqDRdhjGzm0eyim7YuyxEuawv0TakQL8aCI3EkbRc0ktXoGbZxFQsyB4uHYVM2yfzhRnomS$
> >
> > (PRs) in the Apache Flink repository, with only 162 having been updated
> in
> > the last two months
> > <
> https://urldefense.com/v3/__https://github.com/apache/flink/pulls?q=is*3Aopen*is*3Apr*sort*3Aupdated-asc*updated*3A*3E2023-07-19__;JSslKyUrJSU!!IKRxdwAv5BmarQ!eG42_TepSvUmlfJU0BqDRdhjGzm0eyim7YuyxEuawv0TakQL8aCI3EkbRc0ktXoGbZxFQsyB4uHYVM2yf4kpxy62$
> >.
> > This means that more than 85% of the PRs are stale and have not been
> > touched.
> >
> > I suggest setting up Github actions to monitor these stale PRs, and
> > automatically closing them if they have not been updated in the last 'x'
> > days. Authors would still be able to reopen the closed PRs if they
> continue
> > with their work. This would help to keep the PR queue manageable.
> >
> > Not sure if this has been discussed in the Apache Flink community before.
> >
> > Thoughts?
> >
> > Regards
> > Venkata krishnan
>


Close orphaned/stale PRs

2023-09-19 Thread Venkatakrishnan Sowrirajan
Hi Flink devs,

There are currently over 1,000 open pull requests

(PRs) in the Apache Flink repository, with only 162 having been updated in
the last two months
.
This means that more than 85% of the PRs are stale and have not been
touched.

I suggest setting up Github actions to monitor these stale PRs, and
automatically closing them if they have not been updated in the last 'x'
days. Authors would still be able to reopen the closed PRs if they continue
with their work. This would help to keep the PR queue manageable.

Not sure if this has been discussed in the Apache Flink community before.

Thoughts?

Regards
Venkata krishnan


Re: [VOTE] FLIP-361: Improve GC Metrics

2023-09-13 Thread Venkatakrishnan Sowrirajan
+1 (non-binding)

On Wed, Sep 13, 2023, 6:55 PM Matt Wang  wrote:

> +1 (non-binding)
>
>
> Thanks for driving this FLIP
>
>
>
>
> --
>
> Best,
> Matt Wang
>
>
>  Replied Message 
> | From | Xintong Song |
> | Date | 09/14/2023 09:54 |
> | To |  |
> | Subject | Re: [VOTE] FLIP-361: Improve GC Metrics |
> +1 (binding)
>
> Best,
>
> Xintong
>
>
>
> On Thu, Sep 14, 2023 at 2:40 AM Samrat Deb  wrote:
>
> +1 ( non binding)
>
> These improved GC metrics will be a great addition.
>
> Bests,
> Samrat
>
> On Wed, 13 Sep 2023 at 7:58 PM, ConradJam  wrote:
>
> +1 (non-binding)
> gc metrics help with autoscale tuning features
>
> Chen Zhanghao  于2023年9月13日周三 22:16写道:
>
> +1 (unbinding). Looking forward to it
>
> Best,
> Zhanghao Chen
> 
> 发件人: Gyula Fóra 
> 发送时间: 2023年9月13日 21:16
> 收件人: dev 
> 主题: [VOTE] FLIP-361: Improve GC Metrics
>
> Hi All!
>
> Thanks for all the feedback on FLIP-361: Improve GC Metrics [1][2]
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or insufficient votes.
>
> Cheers,
> Gyula
>
> [1]
>
>
>
>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZ7IQaUcI$
> [2]
> https://urldefense.com/v3/__https://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2__;!!IKRxdwAv5BmarQ!dpHSOqsSHlPJ5gCvZ2yxSGjcR4xA2N-mpGZ1w2jPuKb78aujNpbzENmi1e7B26d6v4UQ8bQZFdEMnAg$
>
>
>
> --
> Best
>
> ConradJam
>
>
>


Re: [DISCUSS] FLIP-361: Improve GC Metrics

2023-09-13 Thread Venkatakrishnan Sowrirajan
Hi Gyula,

Thanks for driving this FLIP.

The proposal looks good to me. Only one minor suggestion I have is, can we
also include the % GC time spent wrt the overall CPU time especially useful
in the cases of TM which helps in easily identifying issues related to GC.
Thoughts?

Regards
Venkata krishnan


On Wed, Sep 13, 2023 at 6:13 AM Gyula Fóra  wrote:

> Thanks for all the feedback, I will start the vote on this.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 11:11 AM Xintong Song 
> wrote:
>
> > >
> > > I added the average time metric to the FLIP document. I also included
> it
> > > for the aggregate (total) across all collectors. But maybe it doesn't
> > make
> > > too much sense as collection times usually differ greatly depending on
> > the
> > > collector.
> > >
> >
> > LGTM
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra  wrote:
> >
> > > I added the average time metric to the FLIP document. I also included
> it
> > > for the aggregate (total) across all collectors. But maybe it doesn't
> > make
> > > too much sense as collection times usually differ greatly depending on
> > the
> > > collector.
> > >
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song 
> > > wrote:
> > >
> > > > Thank you :)
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra 
> > wrote:
> > > >
> > > > > Makes sense Xintong, I am happy to extend the proposal with the
> > average
> > > > gc
> > > > > time metric +1
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song <
> tonysong...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > >
> > > > > > > Just so I understand correctly, do you suggest adding a metric
> > for
> > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > .TimePerGc or .AverageTime would make
> > sense.
> > > > > > > AverageTime may be a bit nicer :)
> > > > > > >
> > > > > >
> > > > > > Yes, that's what I mean.
> > > > > >
> > > > > > My only concern is how useful this will be in reality. If there
> are
> > > > only
> > > > > > > (or several) long pauses then the msPerSec metrics will show it
> > > > > already,
> > > > > > > and if there is a single long pause that may not be shown at
> all
> > if
> > > > > there
> > > > > > > are several shorter pauses as well with this metric.
> > > > > >
> > > > > >
> > > > > > Let's say we measure this for every minute and see a 900 msPerSec
> > > > (which
> > > > > > means 54s within the minute are spent on GC). This may come from
> a
> > > > single
> > > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more
> GCs
> > > with
> > > > > > less time each. As the default heartbeat timeout is 50s, the
> former
> > > > means
> > > > > > there's likely a heartbeat timeout due to the GC pause, while the
> > > > latter
> > > > > > means otherwise.
> > > > > >
> > > > > >
> > > > > > Mathematically, it is possible that there's 1 long pause together
> > > with
> > > > > > several short pauses within the same measurement period, making
> the
> > > > long
> > > > > > pause not observable with AverageTime. However, from my
> experience,
> > > > such
> > > > > a
> > > > > > pattern is not normal in reality. My observation is that GCs
> happen
> > > at
> > > > a
> > > > > > similar time usually take a similar length of time. Admittedly,
> > this
> > > is
> > > > > not
> > > > > > a hard guarantee.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Xintong
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra 
> > > > wrote:
> > > > > >
> > > > > > > Matt Wang,
> > > > > > >
> > > > > > > I think the currently exposed info is all that is available
> > through
> > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a
> new
> > > > more
> > > > > > > granular way of reporting the per collector metrics, that would
> > > > > require a
> > > > > > > new mechanism and may be a breaking change.
> > > > > > >
> > > > > > > We basically want to simply extend the current reporting here
> > with
> > > > the
> > > > > > rate
> > > > > > > metrics and the total metrics.
> > > > > > >
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang 
> > wrote:
> > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > +1 for this proposal.
> > > > > > > >
> > > > > > > > Do we need to add a metric to record the count of different
> > > > > > > > collectors? Now there is only a total count. For example,
> > > > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > > > young generation or the old generation.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matt Wang
> > > > > > > >
> > > > > > > >
> > > > > > > >  Replied Message 
> > > > > > > > | From | Gyula Fóra |
> > > > > > > > | Date | 09/6/2023 

Re: [VOTE] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-09-13 Thread Venkatakrishnan Sowrirajan
Thanks for driving this FLIP, Archit. +1 (non-binding)

Regards
Venkata krishnan


On Wed, Sep 13, 2023 at 10:47 AM Archit Goyal 
wrote:

> Hi everyone,
>
> Thanks for reviewing the FLIP-355 Add parent dir of files to classpath
> using yarn.provided.lib.dirs :
>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*355*3A*Add*parent*dir*of*files*to*classpath*using*yarn.provided.lib.dirs__;KyUrKysrKysrKys!!IKRxdwAv5BmarQ!YSiXPEWg6eTjX8YpOOMp_5Qt56yawUg5mjBsdMS4XgrzA9st0aIeaYNZF9DTuNHI-ThSZE7F49ECm52VIR10wMur$
>
> Following is the discussion thread :
>
> https://urldefense.com/v3/__https://lists.apache.org/thread/gv0ro4jsq4o206wg5gz9z5cww15qkvb9__;!!IKRxdwAv5BmarQ!YSiXPEWg6eTjX8YpOOMp_5Qt56yawUg5mjBsdMS4XgrzA9st0aIeaYNZF9DTuNHI-ThSZE7F49ECm52VIZr7rK1T$
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours (until September 15, 12:00AM GMT) unless there is an objection or an
> insufficient number of votes.
>
> Thanks,
> Archit Goyal
>


Re: [Discuss] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-09-13 Thread Venkatakrishnan Sowrirajan
In this case, we required hive-site.xml which is available only on the
client side and not on all the TM nodes. But the workaround you mentioned
for HADOOP_CONF_DIR should work.

On Tue, Sep 12, 2023, 7:35 PM Biao Geng  wrote:

> +1 for the FLIP.
> Another side thought is that in my experience, when users want to make
> Flink TM use a specific hadoop/hive configuration, an easier way is to ship
> the corresponding conf dir and set the env
> variable via containerized.taskmanager.env.HADOOP_CONF_DIR.
>
> Best,
> Biao Geng
>
> Archit Goyal  于2023年9月13日周三 08:00写道:
>
> > Hi All,
> >
> > If there are no further concerns about this FLIP, I will start a vote
> > thread tomorrow.
> >
> > Thanks,
> > Archit Goyal
> >
> >
> > From: Venkatakrishnan Sowrirajan 
> > Date: Thursday, August 24, 2023 at 10:21 PM
> > To: dev@flink.apache.org 
> > Subject: Re: [Discuss] FLIP-355: Add parent dir of files to classpath
> > using yarn.provided.lib.dirs
> > Thanks Yang Wang.
> >
> > In that case, whenever you get a chance could you please review the PR?
> >
> >
> > On Wed, Aug 23, 2023, 8:15 PM Yang Wang  wrote:
> >
> > > +1 for this FLIP
> > >
> > > Maybe a FLIP is an overkill for this enhancement.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > Venkatakrishnan Sowrirajan  于2023年8月23日周三 01:44写道:
> > >
> > > > Thanks for the FLIP, Archit.
> > > >
> > > > This is definitely quite a useful addition to the
> > > *yarn.provided.lib.dirs*
> > > > . +1.
> > > >
> > > > IMO, except for the fact that *yarn.provided.lib.dirs* (platform
> > specific
> > > > jars can be cached) takes only directories whereas *yarn.ship-files*
> > > (user
> > > > files) takes both files and dirs, the overall logic in terms of
> > > > constructing the classpath in both the cases should be roughly the
> > same.
> > > >
> > > > Referencing the PR (
> > >
> >
> https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fgithub.com*2Fapache*2Fflink*2Fpull*2F23164__*3B!!IKRxdwAv5BmarQ!cgTpodngoQAdd-qu3CvbQeAwENiu1nf0eahTH-v1NhUsSn4Y7M7sVGQYSnBjB2XgaOlyzGe7XEiU3-cAOoy84Kw*24=05*7C01*7Cargoyal*40linkedin.com*7C1e626c31ecaf408575b008dba52b1c6a*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C638285376817262437*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000*7C*7C*7C=*2F8QiBnfVJlLZ9atF1WamCsMbAaK0TzEgcCvmd85uSnk*3D=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!IKRxdwAv5BmarQ!anzL1056sJNn-x90OFKxrO_kLtu8STP0j1s8uxyruNAXzhrzTf9sIRbb28d34GQlnza8cVlB8jWs1vgPXw$
> > <
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/23164__;!!IKRxdwAv5BmarQ!cgTpodngoQAdd-qu3CvbQeAwENiu1nf0eahTH-v1NhUsSn4Y7M7sVGQYSnBjB2XgaOlyzGe7XEiU3-cAOoy84Kw$
> > >
> > > ) with the
> > > > initial implementation you created as well here.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Tue, Aug 22, 2023 at 10:09 AM Archit Goyal
> > >  > > > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Gentle ping if I can get a review on the FLIP.
> > > > >
> > > > > Thanks,
> > > > > Archit Goyal
> > > > >
> > > > > From: Archit Goyal 
> > > > > Date: Thursday, August 17, 2023 at 5:51 PM
> > > > > To: dev@flink.apache.org 
> > > > > Subject: [Discuss] FLIP-355: Add parent dir of files to classpath
> > using
> > > > > yarn.provided.lib.dirs
> > > > > Hi All,
> > > > >
> > > > > I am opening this thread to discuss the proposal to add parent
> > > > directories
> > > > > of files to classpath when using yarn.provided.lib.dirs. This is
> > > > documented
> > > > > in FLIP-355 <
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fcwiki.apache.org*2Fconfluence*2Fdisplay*2FFLINK*2FFLIP*355*3A*Add*parent*dir*of*files*to*classpath*using*yarn.provided.lib.dirs__*3BKyUrKysrKysrKys!!IKRxdwAv5BmarQ!fFlyBKWuWwYcWfOcpLflTTi36tyHPiENIUry9J0ygaZY0VURnIs0glu5yafV0w0tfSsnOb9ZxDD9Cjw2TApTekVU*24=05*7C01*7Cargoyal*40linkedin.com*7C1e62

Re: [VOTE] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-09-08 Thread Venkatakrishnan Sowrirajan
Thanks for driving this FLIP, Allison.

+1 (non-binding

Regards
Venkata krishnan


On Thu, Sep 7, 2023 at 1:20 PM Allison Chang 
wrote:

> Hi everyone,
>
> Would like to start the VOTE for FLIP-323<
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-323*3A*Support*Attached*Execution*on*Flink*Application*Completion*for*Batch*Jobs__;JSsrKysrKysrKys!!IKRxdwAv5BmarQ!br6f9zExsFG_B6er-BaEYOK4Bg7TAuZWQXYESve2S8Yycp5DlJ1O8CA_kD8gTlXdh7yk-BjosN7uJiVcLx3U7g8B$
> > which proposes to introduce attached execution for batch jobs. The
> discussion thread can be found here<
> https://urldefense.com/v3/__https://lists.apache.org/thread/d3toldk6qqjh2fnbmqthlfkj9rc6lwgl__;!!IKRxdwAv5BmarQ!br6f9zExsFG_B6er-BaEYOK4Bg7TAuZWQXYESve2S8Yycp5DlJ1O8CA_kD8gTlXdh7yk-BjosN7uJiVcL54ExJ3e$
> >:
>
>
> Best,
>
> Allison Chang
>
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-06 Thread Venkatakrishnan Sowrirajan
Hi everyone,

Posted a PR (https://github.com/apache/flink/pull/23313) to add nested
fields filter pushdown. Please review. Thanks.

Regards
Venkata krishnan


On Tue, Sep 5, 2023 at 10:04 PM Venkatakrishnan Sowrirajan 
wrote:

> Based on an offline discussion with Becket Qin, I added *fieldIndices *
> back which is the field index of the nested field at every level to the 
> *NestedFieldReferenceExpression
> *in FLIP-356
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown>
> *. *2 reasons to do it:
>
> 1. Agree with using *fieldIndices *as the only contract to refer to the
> column from the underlying datasource.
> 2. To keep it consistent with *FieldReferenceExpression*
>
> Having said that, I see that with *projection pushdown, *index of the
> fields are used whereas with *filter pushdown (*based on scanning few
> tablesources) *FieldReferenceExpression*'s name is used for eg: even in
> the Flink's *FileSystemTableSource, IcebergSource, JDBCDatsource*. This
> way, I feel the contract is not quite clear and explicit. Wanted to
> understand other's thoughts as well.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Sep 5, 2023 at 5:34 PM Becket Qin  wrote:
>
>> Hi Venkata,
>>
>>
>> > Also I made minor changes to the *NestedFieldReferenceExpression,
>> *instead
>> > of *fieldIndexArray* we can just do away with *fieldNames *array that
>> > includes fieldName at every level for the nested field.
>>
>>
>> I don't think keeping only the field names array would work. At the end of
>> the day, the contract between Flink SQL and the connectors is based on the
>> indexes, not the names. Technically speaking, the connectors only emit a
>> bunch of RowData which is based on positions. The field names are added by
>> the SQL framework via the DDL for those RowData. In this sense, the
>> connectors may not be aware of the field names in Flink DDL at all. The
>> common language between Flink SQL and source is just positions. This is
>> also why ProjectionPushDown would work by only relying on the indexes, not
>> the field names. So I think the field index array is a must have here in
>> the NestedFieldReferenceExpression.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Fri, Sep 1, 2023 at 8:12 AM Venkatakrishnan Sowrirajan <
>> vsowr...@asu.edu>
>> wrote:
>>
>> > Gentle ping on the vote for FLIP-356: Support Nested fields filter
>> pushdown
>> > <
>> https://urldefense.com/v3/__https://www.mail-archive.com/dev@flink.apache.org/msg69289.html__;!!IKRxdwAv5BmarQ!bOW26WlafOQQcb32eWtUiXBAl0cTCK1C6iYhDI2f_z__eczudAWmTRvjDiZg6gzlXmPXrDV4KJS5cFxagFE$
>> >.
>> >
>> > Regards
>> > Venkata krishnan
>> >
>> >
>> > On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan <
>> > vsowr...@asu.edu>
>> > wrote:
>> >
>> > > Sure, will reference this discussion to resume where we started as
>> part
>> > of
>> > > the flip to refactor SupportsProjectionPushDown.
>> > >
>> > > On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:
>> > >
>> > >> I'm fine with this. `ReferenceExpression` and
>> > `SupportsProjectionPushDown`
>> > >> can be another FLIP. However, could you summarize the design of this
>> > part
>> > >> in the future part of the FLIP? This can be easier to get started
>> with
>> > in
>> > >> the future.
>> > >>
>> > >>
>> > >> Best,
>> > >> Jark
>> > >>
>> > >>
>> > >> On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan <
>> > >> vsowr...@asu.edu>
>> > >> wrote:
>> > >>
>> > >> > Thanks Jark. Sounds good.
>> > >> >
>> > >> > One more thing, earlier in my summary I mentioned,
>> > >> >
>> > >> > Introduce a new *ReferenceExpression* (or
>> *BaseReferenceExpression*)
>> > >> > > abstract class which will be extended by both
>> > >> *FieldReferenceExpression*
>> > >> > >  and *NestedFieldReferenceExpression* (to be introduced as part
>> of
>> > >> this
>> > >> > > FLIP)
>> > >> >
>> > >> > This can be punted for now and can be handled as part of
>> refactoring
>> > >> > SupportsProjectionPushDown.
>> > >> &g

Re: [VOTE] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-06 Thread Venkatakrishnan Sowrirajan
The voting time for [VOTE] FLIP-356: Support Nested Fields Filter Pushdown
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown>
 has passed. I'm closing the vote now.

There were 5 +1 votes which were binding and 3 +1 votes which were
non-binding:

- Jark Wu (binding)
- Martijn Visser (binding)
- Sergey Nuyanzin (binding)
- Becket Qin (binding)
- Jinsong Li (binding)

- Yuepeng Pan (non-binding)
- Conrad Jam (non-binding)
- Jiabao Sun (non-binding)

There were no -1 votes.

Thus, FLIP-356 has been accepted.

Thanks everyone for joining the discussion and giving feedback!

Regards
Venkata krishnan


On Tue, Sep 5, 2023 at 10:25 PM Jingsong Li  wrote:

> +1
>
> On Wed, Sep 6, 2023 at 1:18 PM Becket Qin  wrote:
> >
> > Thanks for pushing the FLIP through.
> >
> > +1 on the updated FLIP wiki.
> >
> > Cheers,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Sep 6, 2023 at 1:12 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> > wrote:
> >
> > > Based on the recent discussions in the thread [DISCUSS] FLIP-356:
> Support
> > > Nested Fields Filter Pushdown
> > > <
> https://urldefense.com/v3/__https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7__;!!IKRxdwAv5BmarQ!fsjoxJdthYZriAJalwMW9WrL898-EmmNnhula2SBLMaghtAqtI7jEmcCZ8gloPISdiYElPbFj5gmfViqMCswvQ$
> >, I made
> > > some changes to the FLIP-356
> > > <
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!fsjoxJdthYZriAJalwMW9WrL898-EmmNnhula2SBLMaghtAqtI7jEmcCZ8gloPISdiYElPbFj5gmfVjJ1KdSjw$
> > > >.
> > > Unless anyone else has any concerns, we can continue with this vote to
> > > reach consensus.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Tue, Sep 5, 2023 at 8:04 AM Sergey Nuyanzin 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Tue, Sep 5, 2023 at 4:55 PM Jiabao Sun  > > > .invalid>
> > > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Best,
> > > > > Jiabao
> > > > >
> > > > >
> > > > > > 2023年9月5日 下午10:33,Martijn Visser  写道:
> > > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Tue, Sep 5, 2023 at 4:16 PM ConradJam 
> > > wrote:
> > > > > >
> > > > > >> +1 (non-binding)
> > > > > >>
> > > > > >> Yuepeng Pan  于2023年9月1日周五 15:43写道:
> > > > > >>
> > > > > >>> +1 (non-binding)
> > > > > >>>
> > > > > >>> Best,
> > > > > >>> Yuepeng
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> At 2023-09-01 14:32:19, "Jark Wu"  wrote:
> > > > > >>>> +1 (binding)
> > > > > >>>>
> > > > > >>>> Best,
> > > > > >>>> Jark
> > > > > >>>>
> > > > > >>>>> 2023年8月30日 02:40,Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> > > 写道:
> > > > > >>>>>
> > > > > >>>>> Hi everyone,
> > > > > >>>>>
> > > > > >>>>> Thank you all for your feedback on FLIP-356. I'd like to
> start a
> > > > > vote.
> > > > > >>>>>
> > > > > >>>>> Discussion thread:
> > > > > >>>>>
> > > >
> > >
> https://urldefense.com/v3/__https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7__;!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wfRxuRMAA$
> > > > > >>>>> FLIP:
> > > > > >>>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wdkI0waFw$
> > > > > >>>>>
> > > > > >>>>> Regards
> > > > > >>>>> Venkata krishnan
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Sergey
> > > >
> > >
>


Re: [VOTE] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-05 Thread Venkatakrishnan Sowrirajan
Based on the recent discussions in the thread [DISCUSS] FLIP-356: Support
Nested Fields Filter Pushdown
<https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7>, I made
some changes to the FLIP-356
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown>.
Unless anyone else has any concerns, we can continue with this vote to
reach consensus.

Regards
Venkata krishnan


On Tue, Sep 5, 2023 at 8:04 AM Sergey Nuyanzin  wrote:

> +1 (binding)
>
> On Tue, Sep 5, 2023 at 4:55 PM Jiabao Sun  .invalid>
> wrote:
>
> > +1 (non-binding)
> >
> > Best,
> > Jiabao
> >
> >
> > > 2023年9月5日 下午10:33,Martijn Visser  写道:
> > >
> > > +1 (binding)
> > >
> > > On Tue, Sep 5, 2023 at 4:16 PM ConradJam  wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> Yuepeng Pan  于2023年9月1日周五 15:43写道:
> > >>
> > >>> +1 (non-binding)
> > >>>
> > >>> Best,
> > >>> Yuepeng
> > >>>
> > >>>
> > >>>
> > >>> At 2023-09-01 14:32:19, "Jark Wu"  wrote:
> > >>>> +1 (binding)
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>> 2023年8月30日 02:40,Venkatakrishnan Sowrirajan  写道:
> > >>>>>
> > >>>>> Hi everyone,
> > >>>>>
> > >>>>> Thank you all for your feedback on FLIP-356. I'd like to start a
> > vote.
> > >>>>>
> > >>>>> Discussion thread:
> > >>>>>
> https://urldefense.com/v3/__https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7__;!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wfRxuRMAA$
> > >>>>> FLIP:
> > >>>>>
> > >>>
> > >>
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!eNR1R48e8jbqDCSdXqWj6bjfmP1uMn-IUIgVX3uXlgzYp_9rcf-nZOaAZ7KzFo2JwMAJPGYv8wdkI0waFw$
> > >>>>>
> > >>>>> Regards
> > >>>>> Venkata krishnan
> > >>>
> > >>
> >
> >
>
> --
> Best regards,
> Sergey
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-09-05 Thread Venkatakrishnan Sowrirajan
Based on an offline discussion with Becket Qin, I added *fieldIndices
*back which
is the field index of the nested field at every level to the
*NestedFieldReferenceExpression
*in FLIP-356
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown>
*. *2 reasons to do it:

1. Agree with using *fieldIndices *as the only contract to refer to the
column from the underlying datasource.
2. To keep it consistent with *FieldReferenceExpression*

Having said that, I see that with *projection pushdown, *index of the
fields are used whereas with *filter pushdown (*based on scanning few
tablesources) *FieldReferenceExpression*'s name is used for eg: even in the
Flink's *FileSystemTableSource, IcebergSource, JDBCDatsource*. This way, I
feel the contract is not quite clear and explicit. Wanted to understand
other's thoughts as well.

Regards
Venkata krishnan


On Tue, Sep 5, 2023 at 5:34 PM Becket Qin  wrote:

> Hi Venkata,
>
>
> > Also I made minor changes to the *NestedFieldReferenceExpression,
> *instead
> > of *fieldIndexArray* we can just do away with *fieldNames *array that
> > includes fieldName at every level for the nested field.
>
>
> I don't think keeping only the field names array would work. At the end of
> the day, the contract between Flink SQL and the connectors is based on the
> indexes, not the names. Technically speaking, the connectors only emit a
> bunch of RowData which is based on positions. The field names are added by
> the SQL framework via the DDL for those RowData. In this sense, the
> connectors may not be aware of the field names in Flink DDL at all. The
> common language between Flink SQL and source is just positions. This is
> also why ProjectionPushDown would work by only relying on the indexes, not
> the field names. So I think the field index array is a must have here in
> the NestedFieldReferenceExpression.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Sep 1, 2023 at 8:12 AM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> wrote:
>
> > Gentle ping on the vote for FLIP-356: Support Nested fields filter
> pushdown
> > <
> https://urldefense.com/v3/__https://www.mail-archive.com/dev@flink.apache.org/msg69289.html__;!!IKRxdwAv5BmarQ!bOW26WlafOQQcb32eWtUiXBAl0cTCK1C6iYhDI2f_z__eczudAWmTRvjDiZg6gzlXmPXrDV4KJS5cFxagFE$
> >.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu>
> > wrote:
> >
> > > Sure, will reference this discussion to resume where we started as part
> > of
> > > the flip to refactor SupportsProjectionPushDown.
> > >
> > > On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:
> > >
> > >> I'm fine with this. `ReferenceExpression` and
> > `SupportsProjectionPushDown`
> > >> can be another FLIP. However, could you summarize the design of this
> > part
> > >> in the future part of the FLIP? This can be easier to get started with
> > in
> > >> the future.
> > >>
> > >>
> > >> Best,
> > >> Jark
> > >>
> > >>
> > >> On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan <
> > >> vsowr...@asu.edu>
> > >> wrote:
> > >>
> > >> > Thanks Jark. Sounds good.
> > >> >
> > >> > One more thing, earlier in my summary I mentioned,
> > >> >
> > >> > Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> > >> > > abstract class which will be extended by both
> > >> *FieldReferenceExpression*
> > >> > >  and *NestedFieldReferenceExpression* (to be introduced as part of
> > >> this
> > >> > > FLIP)
> > >> >
> > >> > This can be punted for now and can be handled as part of refactoring
> > >> > SupportsProjectionPushDown.
> > >> >
> > >> > Also I made minor changes to the *NestedFieldReferenceExpression,
> > >> *instead
> > >> > of *fieldIndexArray* we can just do away with *fieldNames *array
> that
> > >> > includes fieldName at every level for the nested field.
> > >> >
> > >> > Updated the FLIP-357
> > >> > <
> > >> >
> > >>
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!YAk6kV4CYvUSPfpoUDQRs6VlbmJXVX8KOKqFxKbNDkUWKzShvwpkLRGkAV1tgV3EqClNrjGS-Ij86Q$
> > >>

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-31 Thread Venkatakrishnan Sowrirajan
Gentle ping on the vote for FLIP-356: Support Nested fields filter pushdown
<https://www.mail-archive.com/dev@flink.apache.org/msg69289.html>.

Regards
Venkata krishnan


On Tue, Aug 29, 2023 at 9:18 PM Venkatakrishnan Sowrirajan 
wrote:

> Sure, will reference this discussion to resume where we started as part of
> the flip to refactor SupportsProjectionPushDown.
>
> On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:
>
>> I'm fine with this. `ReferenceExpression` and `SupportsProjectionPushDown`
>> can be another FLIP. However, could you summarize the design of this part
>> in the future part of the FLIP? This can be easier to get started with in
>> the future.
>>
>>
>> Best,
>> Jark
>>
>>
>> On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan <
>> vsowr...@asu.edu>
>> wrote:
>>
>> > Thanks Jark. Sounds good.
>> >
>> > One more thing, earlier in my summary I mentioned,
>> >
>> > Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
>> > > abstract class which will be extended by both
>> *FieldReferenceExpression*
>> > >  and *NestedFieldReferenceExpression* (to be introduced as part of
>> this
>> > > FLIP)
>> >
>> > This can be punted for now and can be handled as part of refactoring
>> > SupportsProjectionPushDown.
>> >
>> > Also I made minor changes to the *NestedFieldReferenceExpression,
>> *instead
>> > of *fieldIndexArray* we can just do away with *fieldNames *array that
>> > includes fieldName at every level for the nested field.
>> >
>> > Updated the FLIP-357
>> > <
>> >
>> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!YAk6kV4CYvUSPfpoUDQRs6VlbmJXVX8KOKqFxKbNDkUWKzShvwpkLRGkAV1tgV3EqClNrjGS-Ij86Q$
>> > >
>> > wiki as well.
>> >
>> > Regards
>> > Venkata krishnan
>> >
>> >
>> > On Tue, Aug 29, 2023 at 5:21 AM Jark Wu  wrote:
>> >
>> > > Hi Venkata,
>> > >
>> > > Your summary looks good to me. +1 to start a vote.
>> > >
>> > > I think we don't need "inputIndex" in NestedFieldReferenceExpression.
>> > > Actually, I think it is also not needed in FieldReferenceExpression,
>> > > and we should try to remove it (another topic). The RexInputRef in
>> > Calcite
>> > > also doesn't require an inputIndex because the field index should
>> > represent
>> > > index of the field in the underlying row type. Field references
>> shouldn't
>> > > be
>> > >  aware of the number of inputs.
>> > >
>> > > Best,
>> > > Jark
>> > >
>> > >
>> > > On Tue, 29 Aug 2023 at 02:24, Venkatakrishnan Sowrirajan <
>> > vsowr...@asu.edu
>> > > >
>> > > wrote:
>> > >
>> > > > Hi Jinsong,
>> > > >
>> > > > Thanks for your comments.
>> > > >
>> > > > What is inputIndex in NestedFieldReferenceExpression?
>> > > >
>> > > >
>> > > > I haven't looked at it before. Do you mean, given that it is now
>> only
>> > > used
>> > > > to push filters it won't be subsequently used in further
>> > > > planning/optimization and therefore it is not required at this time?
>> > > >
>> > > > So if NestedFieldReferenceExpression doesn't need inputIndex, is
>> there
>> > > > > a need to introduce a base class `ReferenceExpression`?
>> > > >
>> > > > For SupportsFilterPushDown itself, *ReferenceExpression* base class
>> is
>> > > not
>> > > > needed. But there were discussions around cleaning up and
>> standardizing
>> > > the
>> > > > API for Supports*PushDown. SupportsProjectionPushDown currently
>> pushes
>> > > the
>> > > > projects as a 2-d array, instead it would be better to use the
>> standard
>> > > API
>> > > > which seems to be the *ResolvedExpression*. For
>> > > SupportsProjectionPushDown
>> > > > either FieldReferenceExpression (top level fields) or
>> > > > NestedFieldReferenceExpression (nested fields) is enough, in order
>> to
>> > > > provide a single API that handles both top level and nested

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-29 Thread Venkatakrishnan Sowrirajan
Sure, will reference this discussion to resume where we started as part of
the flip to refactor SupportsProjectionPushDown.

On Tue, Aug 29, 2023, 7:22 PM Jark Wu  wrote:

> I'm fine with this. `ReferenceExpression` and `SupportsProjectionPushDown`
> can be another FLIP. However, could you summarize the design of this part
> in the future part of the FLIP? This can be easier to get started with in
> the future.
>
>
> Best,
> Jark
>
>
> On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan  >
> wrote:
>
> > Thanks Jark. Sounds good.
> >
> > One more thing, earlier in my summary I mentioned,
> >
> > Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> > > abstract class which will be extended by both
> *FieldReferenceExpression*
> > >  and *NestedFieldReferenceExpression* (to be introduced as part of this
> > > FLIP)
> >
> > This can be punted for now and can be handled as part of refactoring
> > SupportsProjectionPushDown.
> >
> > Also I made minor changes to the *NestedFieldReferenceExpression,
> *instead
> > of *fieldIndexArray* we can just do away with *fieldNames *array that
> > includes fieldName at every level for the nested field.
> >
> > Updated the FLIP-357
> > <
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!YAk6kV4CYvUSPfpoUDQRs6VlbmJXVX8KOKqFxKbNDkUWKzShvwpkLRGkAV1tgV3EqClNrjGS-Ij86Q$
> > >
> > wiki as well.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Aug 29, 2023 at 5:21 AM Jark Wu  wrote:
> >
> > > Hi Venkata,
> > >
> > > Your summary looks good to me. +1 to start a vote.
> > >
> > > I think we don't need "inputIndex" in NestedFieldReferenceExpression.
> > > Actually, I think it is also not needed in FieldReferenceExpression,
> > > and we should try to remove it (another topic). The RexInputRef in
> > Calcite
> > > also doesn't require an inputIndex because the field index should
> > represent
> > > index of the field in the underlying row type. Field references
> shouldn't
> > > be
> > >  aware of the number of inputs.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Tue, 29 Aug 2023 at 02:24, Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu
> > > >
> > > wrote:
> > >
> > > > Hi Jinsong,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > What is inputIndex in NestedFieldReferenceExpression?
> > > >
> > > >
> > > > I haven't looked at it before. Do you mean, given that it is now only
> > > used
> > > > to push filters it won't be subsequently used in further
> > > > planning/optimization and therefore it is not required at this time?
> > > >
> > > > So if NestedFieldReferenceExpression doesn't need inputIndex, is
> there
> > > > > a need to introduce a base class `ReferenceExpression`?
> > > >
> > > > For SupportsFilterPushDown itself, *ReferenceExpression* base class
> is
> > > not
> > > > needed. But there were discussions around cleaning up and
> standardizing
> > > the
> > > > API for Supports*PushDown. SupportsProjectionPushDown currently
> pushes
> > > the
> > > > projects as a 2-d array, instead it would be better to use the
> standard
> > > API
> > > > which seems to be the *ResolvedExpression*. For
> > > SupportsProjectionPushDown
> > > > either FieldReferenceExpression (top level fields) or
> > > > NestedFieldReferenceExpression (nested fields) is enough, in order to
> > > > provide a single API that handles both top level and nested fields,
> > > > ReferenceExpression will be introduced as a base class.
> > > >
> > > > Eventually, *SupportsProjectionPushDown#applyProjections* would
> evolve
> > as
> > > > applyProjection(List projectedFields) and nested
> > > > fields would be pushed only if *supportsNestedProjections* returns
> > true.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Sun, Aug 27, 2023 at 11:12 PM Jingsong Li  >
> > > > wrote:
> > > >
> > > > > So if NestedFieldReferenceExpression doesn't need inpu

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-29 Thread Venkatakrishnan Sowrirajan
Thanks Jark. Sounds good.

One more thing, earlier in my summary I mentioned,

Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> abstract class which will be extended by both *FieldReferenceExpression*
>  and *NestedFieldReferenceExpression* (to be introduced as part of this
> FLIP)

This can be punted for now and can be handled as part of refactoring
SupportsProjectionPushDown.

Also I made minor changes to the *NestedFieldReferenceExpression, *instead
of *fieldIndexArray* we can just do away with *fieldNames *array that
includes fieldName at every level for the nested field.

Updated the FLIP-357
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown>
wiki as well.

Regards
Venkata krishnan


On Tue, Aug 29, 2023 at 5:21 AM Jark Wu  wrote:

> Hi Venkata,
>
> Your summary looks good to me. +1 to start a vote.
>
> I think we don't need "inputIndex" in NestedFieldReferenceExpression.
> Actually, I think it is also not needed in FieldReferenceExpression,
> and we should try to remove it (another topic). The RexInputRef in Calcite
> also doesn't require an inputIndex because the field index should represent
> index of the field in the underlying row type. Field references shouldn't
> be
>  aware of the number of inputs.
>
> Best,
> Jark
>
>
> On Tue, 29 Aug 2023 at 02:24, Venkatakrishnan Sowrirajan  >
> wrote:
>
> > Hi Jinsong,
> >
> > Thanks for your comments.
> >
> > What is inputIndex in NestedFieldReferenceExpression?
> >
> >
> > I haven't looked at it before. Do you mean, given that it is now only
> used
> > to push filters it won't be subsequently used in further
> > planning/optimization and therefore it is not required at this time?
> >
> > So if NestedFieldReferenceExpression doesn't need inputIndex, is there
> > > a need to introduce a base class `ReferenceExpression`?
> >
> > For SupportsFilterPushDown itself, *ReferenceExpression* base class is
> not
> > needed. But there were discussions around cleaning up and standardizing
> the
> > API for Supports*PushDown. SupportsProjectionPushDown currently pushes
> the
> > projects as a 2-d array, instead it would be better to use the standard
> API
> > which seems to be the *ResolvedExpression*. For
> SupportsProjectionPushDown
> > either FieldReferenceExpression (top level fields) or
> > NestedFieldReferenceExpression (nested fields) is enough, in order to
> > provide a single API that handles both top level and nested fields,
> > ReferenceExpression will be introduced as a base class.
> >
> > Eventually, *SupportsProjectionPushDown#applyProjections* would evolve as
> > applyProjection(List projectedFields) and nested
> > fields would be pushed only if *supportsNestedProjections* returns true.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Sun, Aug 27, 2023 at 11:12 PM Jingsong Li 
> > wrote:
> >
> > > So if NestedFieldReferenceExpression doesn't need inputIndex, is there
> > > a need to introduce a base class `ReferenceExpression`?
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Mon, Aug 28, 2023 at 2:09 PM Jingsong Li 
> > > wrote:
> > > >
> > > > Hi thanks all for your discussion.
> > > >
> > > > What is inputIndex in NestedFieldReferenceExpression?
> > > >
> > > > I know inputIndex has special usage in FieldReferenceExpression, but
> > > > it is only for Join operators, and it is only for SQL optimization.
> It
> > > > looks like there is no requirement for Nested.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Mon, Aug 28, 2023 at 1:13 PM Venkatakrishnan Sowrirajan
> > > >  wrote:
> > > > >
> > > > > Thanks for all the feedback and discussion everyone. Looks like we
> > have
> > > > > reached a consensus here.
> > > > >
> > > > > Just to summarize:
> > > > >
> > > > > 1. Introduce a new *ReferenceExpression* (or
> > *BaseReferenceExpression*)
> > > > > abstract class which will be extended by both
> > > *FieldReferenceExpression*
> > > > > and *NestedFieldReferenceExpression* (to be introduced as part of
> > this
> > > FLIP)
> > > > > 2. No need of *supportsNestedFilters *check as the current
> > > > > *SupportsFilterPushDown* should already ignore unknown expressions
> (
> > > > > *NestedFieldReferen

[VOTE] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-29 Thread Venkatakrishnan Sowrirajan
Hi everyone,

Thank you all for your feedback on FLIP-356. I'd like to start a vote.

Discussion thread:
https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7
FLIP:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown

Regards
Venkata krishnan


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-28 Thread Venkatakrishnan Sowrirajan
Hi Jinsong,

Thanks for your comments.

What is inputIndex in NestedFieldReferenceExpression?


I haven't looked at it before. Do you mean, given that it is now only used
to push filters it won't be subsequently used in further
planning/optimization and therefore it is not required at this time?

So if NestedFieldReferenceExpression doesn't need inputIndex, is there
> a need to introduce a base class `ReferenceExpression`?

For SupportsFilterPushDown itself, *ReferenceExpression* base class is not
needed. But there were discussions around cleaning up and standardizing the
API for Supports*PushDown. SupportsProjectionPushDown currently pushes the
projects as a 2-d array, instead it would be better to use the standard API
which seems to be the *ResolvedExpression*. For SupportsProjectionPushDown
either FieldReferenceExpression (top level fields) or
NestedFieldReferenceExpression (nested fields) is enough, in order to
provide a single API that handles both top level and nested fields,
ReferenceExpression will be introduced as a base class.

Eventually, *SupportsProjectionPushDown#applyProjections* would evolve as
applyProjection(List projectedFields) and nested
fields would be pushed only if *supportsNestedProjections* returns true.

Regards
Venkata krishnan


On Sun, Aug 27, 2023 at 11:12 PM Jingsong Li  wrote:

> So if NestedFieldReferenceExpression doesn't need inputIndex, is there
> a need to introduce a base class `ReferenceExpression`?
>
> Best,
> Jingsong
>
> On Mon, Aug 28, 2023 at 2:09 PM Jingsong Li 
> wrote:
> >
> > Hi thanks all for your discussion.
> >
> > What is inputIndex in NestedFieldReferenceExpression?
> >
> > I know inputIndex has special usage in FieldReferenceExpression, but
> > it is only for Join operators, and it is only for SQL optimization. It
> > looks like there is no requirement for Nested.
> >
> > Best,
> > Jingsong
> >
> > On Mon, Aug 28, 2023 at 1:13 PM Venkatakrishnan Sowrirajan
> >  wrote:
> > >
> > > Thanks for all the feedback and discussion everyone. Looks like we have
> > > reached a consensus here.
> > >
> > > Just to summarize:
> > >
> > > 1. Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> > > abstract class which will be extended by both
> *FieldReferenceExpression*
> > > and *NestedFieldReferenceExpression* (to be introduced as part of this
> FLIP)
> > > 2. No need of *supportsNestedFilters *check as the current
> > > *SupportsFilterPushDown* should already ignore unknown expressions (
> > > *NestedFieldReferenceExpression* for example) and return them as
> > > *remainingFilters.
> > > *Maybe this should be clarified explicitly in the Javadoc of
> > > *SupportsFilterPushDown.
> > > *I will file a separate JIRA to fix the documentation.
> > > 3. Refactor *SupportsProjectionPushDown* to use *ReferenceExpression
> *instead
> > > of existing 2-d arrays to consolidate and be consistent with other
> > > Supports*PushDown APIs - *outside the scope of this FLIP*
> > > 4. Similarly *SupportsAggregatePushDown* should also be evolved
> whenever
> > > nested fields support is added to use the *ReferenceExpression -
> **outside
> > > the scope of this FLIP*
> > >
> > > Does this sound good? Please let me know if I have missed anything
> here. If
> > > there are no concerns, I will start a vote tomorrow. I will also get
> the
> > > FLIP-356 wiki updated. Thanks everyone once again!
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Thu, Aug 24, 2023 at 8:19 PM Becket Qin 
> wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > How about having a separate NestedFieldReferenceExpression, and
> > > > > abstracting a common base class "ReferenceExpression" for
> > > > > NestedFieldReferenceExpression and FieldReferenceExpression? This
> makes
> > > > > unifying expressions in
> > > > >
> "SupportsProjectionPushdown#applyProjections(List
> > > > > ...)"
> > > > > possible.
> > > >
> > > >
> > > > I'd be fine with this. It at least provides a consistent API style /
> > > > formality.
> > > >
> > > >  Re: Yunhong,
> > > >
> > > > 3. Finally, I think we need to look at the costs and benefits of
> unifying
> > > > > the SupportsFilterPushDown and SupportsProjectionPushDown (or
> others)
> > > > from
> > > > > the perspective of interface 

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-27 Thread Venkatakrishnan Sowrirajan
ally, I think we need to look at the costs and benefits of unifying
> > the SupportsFilterPushDown and SupportsProjectionPushDown (or others)
> from
> > the perspective of interface implementers. A stable API can reduce user
> > development and change costs, if the current API can fully meet the
> > functional requirements at the framework level, I personal suggest
> reducing
> > the impact on connector developers.
> >
> > Regards,
> > Yunhong Zheng (Swuferhong)
> >
> >
> > Venkatakrishnan Sowrirajan  于2023年8月25日周五 01:25写道:
> >
> > > To keep it backwards compatible, introduce another API *applyAggregates
> > > *with
> > > *List *when nested field support is added and
> > > deprecate the current API. This will by default throw an exception. In
> > > flink planner, *applyAggregates *with nested fields and if it throws
> > > exception then *applyAggregates* without nested fields.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Thu, Aug 24, 2023 at 10:13 AM Venkatakrishnan Sowrirajan <
> > > vsowr...@asu.edu> wrote:
> > >
> > > > Jark,
> > > >
> > > > How about having a separate NestedFieldReferenceExpression, and
> > > >> abstracting a common base class "ReferenceExpression" for
> > > >> NestedFieldReferenceExpression and FieldReferenceExpression? This
> > makes
> > > >> unifying expressions in
> > > >>
> "SupportsProjectionPushdown#applyProjections(List
> > > >> ...)"
> > > >> possible.
> > > >
> > > > This should be fine for *SupportsProjectionPushDown* and
> > > > *SupportsFilterPushDown*. One concern in the case of
> > > > *SupportsAggregatePushDown* with nested fields support (to be added
> in
> > > > the future), with this proposal, the API will become backwards
> > > incompatible
> > > > as the *args *for the aggregate function is
> > > *List
> > > > *that needs to change to *List*.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Thu, Aug 24, 2023 at 1:18 AM Jark Wu  wrote:
> > > >
> > > >> Hi Becket,
> > > >>
> > > >> I think it is the second case, that a FieldReferenceExpression is
> > > >> constructed
> > > >> by the framework and passed to the connector (interfaces listed by
> > > >> Venkata[1]
> > > >> and Catalog#listPartitionsByFilter). Besides, understanding the
> nested
> > > >> field
> > > >> is optional for users/connectors (just treat it as an unknown
> > expression
> > > >> if
> > > >> the
> > > >> connector doesn't want to support it).
> > > >>
> > > >> If we extend FieldReferenceExpression, in the case of "where
> > col.nested
> > > >
> > > >> 10",
> > > >> for the connectors already supported filter/delete pushdown, they
> may
> > > >> wrongly
> > > >> pushdown "col > 10" instead of "nested > 10" because they still
> treat
> > > >> FieldReferenceExpression as a top-level column. This problem can be
> > > >> resolved
> > > >> by introducing an additional "supportedNestedPushdown" for each
> > > interface,
> > > >> but that method is not elegant and is hard to remove in the future,
> > and
> > > >> this could
> > > >> be avoided if we have a separate NestedFieldReferenceExpression.
> > > >>
> > > >> If we want to extend FieldReferenceExpression, we have to add
> > > protections
> > > >> for every related API in one shot. Besides, FieldReferenceExpression
> > is
> > > a
> > > >> fundamental class in the planner, we have to go through all the code
> > > that
> > > >> is using it to make sure it properly handling it if it is a nested
> > field
> > > >> which
> > > >> is a big effort for the community.
> > > >>
> > > >> If we were designing this API on day 1, I fully support merging them
> > in
> > > a
> > > >> FieldReferenceExpression. But in this case, I'm thinking about how
> to
> > > >> provide
> > > >&g

Re: [Discuss] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-08-24 Thread Venkatakrishnan Sowrirajan
Thanks Yang Wang.

In that case, whenever you get a chance could you please review the PR?


On Wed, Aug 23, 2023, 8:15 PM Yang Wang  wrote:

> +1 for this FLIP
>
> Maybe a FLIP is an overkill for this enhancement.
>
>
> Best,
> Yang
>
> Venkatakrishnan Sowrirajan  于2023年8月23日周三 01:44写道:
>
> > Thanks for the FLIP, Archit.
> >
> > This is definitely quite a useful addition to the
> *yarn.provided.lib.dirs*
> > . +1.
> >
> > IMO, except for the fact that *yarn.provided.lib.dirs* (platform specific
> > jars can be cached) takes only directories whereas *yarn.ship-files*
> (user
> > files) takes both files and dirs, the overall logic in terms of
> > constructing the classpath in both the cases should be roughly the same.
> >
> > Referencing the PR (
> https://urldefense.com/v3/__https://github.com/apache/flink/pull/23164__;!!IKRxdwAv5BmarQ!cgTpodngoQAdd-qu3CvbQeAwENiu1nf0eahTH-v1NhUsSn4Y7M7sVGQYSnBjB2XgaOlyzGe7XEiU3-cAOoy84Kw$
> ) with the
> > initial implementation you created as well here.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Tue, Aug 22, 2023 at 10:09 AM Archit Goyal
>  > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Gentle ping if I can get a review on the FLIP.
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> > > From: Archit Goyal 
> > > Date: Thursday, August 17, 2023 at 5:51 PM
> > > To: dev@flink.apache.org 
> > > Subject: [Discuss] FLIP-355: Add parent dir of files to classpath using
> > > yarn.provided.lib.dirs
> > > Hi All,
> > >
> > > I am opening this thread to discuss the proposal to add parent
> > directories
> > > of files to classpath when using yarn.provided.lib.dirs. This is
> > documented
> > > in FLIP-355 <
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*355*3A*Add*parent*dir*of*files*to*classpath*using*yarn.provided.lib.dirs__;KyUrKysrKysrKys!!IKRxdwAv5BmarQ!fFlyBKWuWwYcWfOcpLflTTi36tyHPiENIUry9J0ygaZY0VURnIs0glu5yafV0w0tfSsnOb9ZxDD9Cjw2TApTekVU$
> > > >.
> > >
> > > This FLIP mentions about enhancing YARN's classpath configuration to
> > > include parent directories of files in yarn.provided.lib.dirs.
> > >
> > > Please feel free to reply to this email thread and share your opinions.
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> >
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-24 Thread Venkatakrishnan Sowrirajan
To keep it backwards compatible, introduce another API *applyAggregates *with
*List *when nested field support is added and
deprecate the current API. This will by default throw an exception. In
flink planner, *applyAggregates *with nested fields and if it throws
exception then *applyAggregates* without nested fields.

Regards
Venkata krishnan


On Thu, Aug 24, 2023 at 10:13 AM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> Jark,
>
> How about having a separate NestedFieldReferenceExpression, and
>> abstracting a common base class "ReferenceExpression" for
>> NestedFieldReferenceExpression and FieldReferenceExpression? This makes
>> unifying expressions in
>> "SupportsProjectionPushdown#applyProjections(List
>> ...)"
>> possible.
>
> This should be fine for *SupportsProjectionPushDown* and
> *SupportsFilterPushDown*. One concern in the case of
> *SupportsAggregatePushDown* with nested fields support (to be added in
> the future), with this proposal, the API will become backwards incompatible
> as the *args *for the aggregate function is *List
> *that needs to change to *List*.
>
> Regards
> Venkata krishnan
>
>
> On Thu, Aug 24, 2023 at 1:18 AM Jark Wu  wrote:
>
>> Hi Becket,
>>
>> I think it is the second case, that a FieldReferenceExpression is
>> constructed
>> by the framework and passed to the connector (interfaces listed by
>> Venkata[1]
>> and Catalog#listPartitionsByFilter). Besides, understanding the nested
>> field
>> is optional for users/connectors (just treat it as an unknown expression
>> if
>> the
>> connector doesn't want to support it).
>>
>> If we extend FieldReferenceExpression, in the case of "where col.nested >
>> 10",
>> for the connectors already supported filter/delete pushdown, they may
>> wrongly
>> pushdown "col > 10" instead of "nested > 10" because they still treat
>> FieldReferenceExpression as a top-level column. This problem can be
>> resolved
>> by introducing an additional "supportedNestedPushdown" for each interface,
>> but that method is not elegant and is hard to remove in the future, and
>> this could
>> be avoided if we have a separate NestedFieldReferenceExpression.
>>
>> If we want to extend FieldReferenceExpression, we have to add protections
>> for every related API in one shot. Besides, FieldReferenceExpression is a
>> fundamental class in the planner, we have to go through all the code that
>> is using it to make sure it properly handling it if it is a nested field
>> which
>> is a big effort for the community.
>>
>> If we were designing this API on day 1, I fully support merging them in a
>> FieldReferenceExpression. But in this case, I'm thinking about how to
>> provide
>> users with a smooth migration path, and allow the community to gradually
>> put efforts into evolving the API, and not block the "Nested Fields Filter
>> Pushdown"
>> requirement.
>>
>> How about having a separate NestedFieldReferenceExpression, and
>> abstracting a common base class "ReferenceExpression" for
>> NestedFieldReferenceExpression and FieldReferenceExpression? This makes
>> unifying expressions in
>> "SupportsProjectionPushdown#applyProjections(List
>> ...)"
>> possible.
>>
>> Best,
>> Jark
>>
>> On Thu, 24 Aug 2023 at 07:00, Venkatakrishnan Sowrirajan <
>> vsowr...@asu.edu>
>> wrote:
>>
>> > Becket and Jark,
>> >
>> >  Deprecate all the other
>> > > methods except tryApplyFilters() and tryApplyProjections().
>> >
>> > For *SupportsProjectionPushDown*, we still need a
>> > *supportsNestedProjections* API on the table source as some of the table
>> > sources might not be able to handle nested fields and therefore the
>> Flink
>> > planner should not push down the nested projections or else the
>> > *applyProjection
>> > *API has to be appropriately changed to return
>> > *unconvertibleProjections *similar
>> > to *SupportsFilterPushDown*.
>> >
>> > Or we have to introduce two different applyProjections()
>> > > methods for FieldReferenceExpression / NestedFieldReferenceExpression
>> > > respectively.
>> >
>> > Agree this is not preferred. Given that *supportNestedProjections
>> *cannot
>> > be deprecated/removed based on the current API form, extending
>> > *FieldReferenceExpression* to support nested fields should be okay.
>> >

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-24 Thread Venkatakrishnan Sowrirajan
Jark,

How about having a separate NestedFieldReferenceExpression, and
> abstracting a common base class "ReferenceExpression" for
> NestedFieldReferenceExpression and FieldReferenceExpression? This makes
> unifying expressions in
> "SupportsProjectionPushdown#applyProjections(List
> ...)"
> possible.

This should be fine for *SupportsProjectionPushDown* and
*SupportsFilterPushDown*. One concern in the case of
*SupportsAggregatePushDown* with nested fields support (to be added in the
future), with this proposal, the API will become backwards incompatible as
the *args *for the aggregate function is *List *that
needs to change to *List*.

Regards
Venkata krishnan


On Thu, Aug 24, 2023 at 1:18 AM Jark Wu  wrote:

> Hi Becket,
>
> I think it is the second case, that a FieldReferenceExpression is
> constructed
> by the framework and passed to the connector (interfaces listed by
> Venkata[1]
> and Catalog#listPartitionsByFilter). Besides, understanding the nested
> field
> is optional for users/connectors (just treat it as an unknown expression if
> the
> connector doesn't want to support it).
>
> If we extend FieldReferenceExpression, in the case of "where col.nested >
> 10",
> for the connectors already supported filter/delete pushdown, they may
> wrongly
> pushdown "col > 10" instead of "nested > 10" because they still treat
> FieldReferenceExpression as a top-level column. This problem can be
> resolved
> by introducing an additional "supportedNestedPushdown" for each interface,
> but that method is not elegant and is hard to remove in the future, and
> this could
> be avoided if we have a separate NestedFieldReferenceExpression.
>
> If we want to extend FieldReferenceExpression, we have to add protections
> for every related API in one shot. Besides, FieldReferenceExpression is a
> fundamental class in the planner, we have to go through all the code that
> is using it to make sure it properly handling it if it is a nested field
> which
> is a big effort for the community.
>
> If we were designing this API on day 1, I fully support merging them in a
> FieldReferenceExpression. But in this case, I'm thinking about how to
> provide
> users with a smooth migration path, and allow the community to gradually
> put efforts into evolving the API, and not block the "Nested Fields Filter
> Pushdown"
> requirement.
>
> How about having a separate NestedFieldReferenceExpression, and
> abstracting a common base class "ReferenceExpression" for
> NestedFieldReferenceExpression and FieldReferenceExpression? This makes
> unifying expressions in
> "SupportsProjectionPushdown#applyProjections(List
> ...)"
> possible.
>
> Best,
> Jark
>
> On Thu, 24 Aug 2023 at 07:00, Venkatakrishnan Sowrirajan  >
> wrote:
>
> > Becket and Jark,
> >
> >  Deprecate all the other
> > > methods except tryApplyFilters() and tryApplyProjections().
> >
> > For *SupportsProjectionPushDown*, we still need a
> > *supportsNestedProjections* API on the table source as some of the table
> > sources might not be able to handle nested fields and therefore the Flink
> > planner should not push down the nested projections or else the
> > *applyProjection
> > *API has to be appropriately changed to return
> > *unconvertibleProjections *similar
> > to *SupportsFilterPushDown*.
> >
> > Or we have to introduce two different applyProjections()
> > > methods for FieldReferenceExpression / NestedFieldReferenceExpression
> > > respectively.
> >
> > Agree this is not preferred. Given that *supportNestedProjections *cannot
> > be deprecated/removed based on the current API form, extending
> > *FieldReferenceExpression* to support nested fields should be okay.
> >
> > Another alternative could be to change *applyProjections *to take
> > List and on the connector side they choose to handle
> > *FieldReferenceExpression* and *NestedFieldReferenceExpression *as
> > applicable and return the remainingProjections. In the case of nested
> field
> > projections not supported, it should return them back but only projecting
> > the top level fields. IMO, this is also *not preferred*.
> >
> > *SupportsAggregatePushDown*
> >
> > *AggregateExpression *currently takes in a list of
> > *FieldReferenceExpression* as args for the aggregate function, if in
> future
> > *SupportsAggregatePushDown* adds support for aggregate pushdown on nested
> > fields then the AggregateExpression API also has to change if a new
> > NestedFieldReferenceExpression is introduced for nested fields.
&g

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-23 Thread Venkatakrishnan Sowrirajan
ration path.
> > The FiledReferenceExpression is widely used in many public APIs,
> > not only in the SupportsFilterPushDown. Yes, we can change every
> > methods in 2-steps, but is it good to change API back and forth for this?
> > Personally, I'm fine with a separate NestedFieldReferenceExpression
> class.
> > TBH, I prefer the separated way because it makes the reference expression
> > more clear and concise.
> >
> > Best,
> > Jark
> >
> >
> > On Tue, 22 Aug 2023 at 16:53, Becket Qin  wrote:
> >
> > > Thanks for the reply, Jark.
> > >
> > > I think it will be helpful to understand the final state we want to
> > > eventually achieve first, then we can discuss the steps towards that
> > final
> > > state.
> > >
> > > It looks like there are two proposed end states now:
> > >
> > > 1. Have a separate NestedFieldReferenceExpression class; keep
> > > SupportsFilterPushDown and SupportsProjectionPushDown the same. It is
> > just
> > > a one step change.
> > >- Regarding the supportsNestedFilterPushDown() method, if our
> contract
> > > with the connector developer today is "The implementation should ignore
> > > unrecognized expressions by putting them into the remaining filters,
> > > instead of throwing exceptions". Then there is no need for this
> method. I
> > > am not sure about the current contract. We should probably make it
> clear
> > in
> > > the interface Java doc.
> > >
> > > 2. Extend the existing FiledReferenceExpression class to support nested
> > > fields; SupportsFilterPushDown only has one method of
> > > applyFilters(List); SupportsProjectionPushDown only
> > has
> > > one method of applyProjections(List,
> DataType).
> > > It could just be two steps if we are not too obsessed with the exact
> > names
> > > of "applyFilters" and "applyProjections". More specifically, it takes
> two
> > > steps to achieve this final state:
> > > a. introduce a new method tryApplyFilters(List)
> > to
> > > SupportsFilterPushDown, which may have FiledReferenceExpression with
> > nested
> > > fields. The default implementation throws an exception. The runtime
> will
> > > first call tryApplyFilters() with nested fields. In case of exception,
> it
> > > calls the existing applyFilters() without including the nested filters.
> > > Similarly, in SupportsProjectionPushDown, introduce a
> > > tryApplyProjections method returning a
> Result.
> > > The Result also contains the accepted and unapplicable projections. The
> > > default implementation also throws an exception. Deprecate all the
> other
> > > methods except tryApplyFilters() and tryApplyProjections().
> > > b. remove the deprecated methods in the next major version bump.
> > >
> > > Now the question is putting the migration steps aside, which end state
> do
> > > we prefer? While the first end state is acceptable for me, personally,
> I
> > > prefer the latter if we are designing from scratch. It is clean,
> > consistent
> > > and intuitive. Given the size of Flink, keeping APIs in the same style
> > over
> > > time is important. The migration is also not that complicated.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > > On Tue, Aug 22, 2023 at 2:23 PM Jark Wu  wrote:
> > >
> > > > Hi Venkat,
> > > >
> > > > Thanks for the proposal.
> > > >
> > > > I have some minor comments about the FLIP.
> > > >
> > > > 1. I think we don't need to
> > > > add SupportsFilterPushDown#supportsNestedFilters() method,
> > > > because connectors can skip nested filters by putting them in
> > > > Result#remainingFilters().
> > > > And this is backward-compatible because unknown expressions were
> added
> > to
> > > > the remaining filters.
> > > > Planner should push predicate expressions as more as possible. If we
> > add
> > > a
> > > > flag for each new filter,
> > > > the interface will be filled with lots of flags (e.g.,
> supportsBetween,
> > > > supportsIN).
> > > >
> > > > 2. NestedFieldReferenceExpression#nestedFieldName should be an array
> of
> > > > field names?
> > > > Each string represents a field name part of the field p

Re: [Discuss] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-08-22 Thread Venkatakrishnan Sowrirajan
Thanks for the FLIP, Archit.

This is definitely quite a useful addition to the *yarn.provided.lib.dirs*
. +1.

IMO, except for the fact that *yarn.provided.lib.dirs* (platform specific
jars can be cached) takes only directories whereas *yarn.ship-files* (user
files) takes both files and dirs, the overall logic in terms of
constructing the classpath in both the cases should be roughly the same.

Referencing the PR (https://github.com/apache/flink/pull/23164) with the
initial implementation you created as well here.

Regards
Venkata krishnan


On Tue, Aug 22, 2023 at 10:09 AM Archit Goyal 
wrote:

> Hi all,
>
> Gentle ping if I can get a review on the FLIP.
>
> Thanks,
> Archit Goyal
>
> From: Archit Goyal 
> Date: Thursday, August 17, 2023 at 5:51 PM
> To: dev@flink.apache.org 
> Subject: [Discuss] FLIP-355: Add parent dir of files to classpath using
> yarn.provided.lib.dirs
> Hi All,
>
> I am opening this thread to discuss the proposal to add parent directories
> of files to classpath when using yarn.provided.lib.dirs. This is documented
> in FLIP-355 <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*355*3A*Add*parent*dir*of*files*to*classpath*using*yarn.provided.lib.dirs__;KyUrKysrKysrKys!!IKRxdwAv5BmarQ!fFlyBKWuWwYcWfOcpLflTTi36tyHPiENIUry9J0ygaZY0VURnIs0glu5yafV0w0tfSsnOb9ZxDD9Cjw2TApTekVU$
> >.
>
> This FLIP mentions about enhancing YARN's classpath configuration to
> include parent directories of files in yarn.provided.lib.dirs.
>
> Please feel free to reply to this email thread and share your opinions.
>
> Thanks,
> Archit Goyal
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-21 Thread Venkatakrishnan Sowrirajan
Sounds like a great suggestion, Becket. +1. Agree with cleaning up the APIs
and making it consistent in all the pushdown APIs.

Your suggested approach seems fine to me, unless anyone else has any other
concerns. Just have couple of clarifying questions:

1. Do you think we should standardize the APIs across all the pushdown
supports like SupportsPartitionPushdown, SupportsDynamicFiltering etc in
the end state?

The current proposal works if we do not want to migrate
> SupportsFilterPushdown to also use NestedFieldReferenceExpression in the
> long term.
>
Did you mean *FieldReferenceExpression* instead of
*NestedFieldReferenceExpression*?

2. Extend the FieldReferenceExpression to support nested fields.
> - Change the index field type from int to int[].

- Add a new method int[] getFieldIndexArray().
> - Deprecate the int getFieldIndex() method, the code will be removed in
> the next major version bump.

I assume getFieldIndex would return fieldIndexArray[0], right?

Thanks
Venkat

On Fri, Aug 18, 2023 at 4:47 PM Becket Qin  wrote:

> Thanks for the proposal, Venkata.
>
> The current proposal works if we do not want to migrate
> SupportsFilterPushdown to also use NestedFieldReferenceExpression in the
> long term.
>
Did you mean *FieldReferenceExpression* instead of
*NestedFieldReferenceExpression*?

>
> Otherwise, the alternative solution briefly mentioned in the rejected
> alternatives would be the following:
> Phase 1:
> 1. Introduce a supportsNestedFilters() method to the SupportsFilterPushdown
> interface. (same as current proposal).
> 2. Extend the FieldReferenceExpression to support nested fields.
> - Change the index field type from int to int[].

- Add a new method int[] getFieldIndexArray().
> - Deprecate the int getFieldIndex() method, the code will be removed in
> the next major version bump.



3. In the SupportsProjectionPushDown interface
> - add a new method applyProjection(List,
> DataType), with default implementation invoking applyProjection(int[][],
> DataType)
> - deprecate the current applyProjection(int[][], DataType) method
>
> Phase 2 (in the next major version bump)
> 1. remove the deprecated methods.
>
> Phase 3 (optional)
> 1. deprecate and remove the supportsNestedFilters() /
> supportsNestedProjection() methods from the SupportsFilterPushDown /
> SupportsProjectionPushDown interfaces.
>
> Personally I prefer this alternative. It takes longer to finish the work,
> but the API eventually becomes clean and consistent. But I can live with
> the current proposal.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sat, Aug 19, 2023 at 12:09 AM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
> > Gentle ping for reviews/feedback.
> >
> > On Tue, Aug 15, 2023, 5:37 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu
> > >
> > wrote:
> >
> > > Hi All,
> > >
> > > I am opening this thread to discuss FLIP-356: Support Nested Fields
> > > Filter Pushdown. The FLIP can be found at
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!clxXJwshKpn559SAkQiieqgGe0ZduXCzUKCmYLtFIbQLmrmEEgdmuEIM8ZM1M3O_uGqOploU4ailqGpukAg$
> > >
> > > This FLIP adds support for pushing down nested fields filters to the
> > > underlying TableSource. In our data lake, we find a lot of datasets
> have
> > > nested fields and also user queries with filters defined on the nested
> > > fields. This would drastically improve the performance for those sets
> of
> > > queries.
> > >
> > > Appreciate any comments or feedback you may have on this proposal.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> >
>


Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-18 Thread Venkatakrishnan Sowrirajan
Gentle ping for reviews/feedback.

On Tue, Aug 15, 2023, 5:37 PM Venkatakrishnan Sowrirajan 
wrote:

> Hi All,
>
> I am opening this thread to discuss FLIP-356: Support Nested Fields
> Filter Pushdown. The FLIP can be found at
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
>
> This FLIP adds support for pushing down nested fields filters to the
> underlying TableSource. In our data lake, we find a lot of datasets have
> nested fields and also user queries with filters defined on the nested
> fields. This would drastically improve the performance for those sets of
> queries.
>
> Appreciate any comments or feedback you may have on this proposal.
>
> Regards
> Venkata krishnan
>


Re: FLINK-20767 - Support for nested fields filter push down

2023-08-18 Thread Venkatakrishnan Sowrirajan
Gentle ping

On Wed, Aug 16, 2023, 11:56 AM Venkatakrishnan Sowrirajan 
wrote:

> Forgot to share the link -
> https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7 in the
> last email.
>
> Regards
> Venkata krishnan
>
>
> On Wed, Aug 16, 2023 at 11:55 AM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu> wrote:
>
>> Btw, this is the FLIP proposal discussion thread. Please share your
>> thoughts. Thanks.
>>
>> Regards
>> Venkata krishnan
>>
>>
>> On Sun, Aug 13, 2023 at 6:35 AM liu ron  wrote:
>>
>>> Hi, Venkata krishnan
>>>
>>> Thanks for driving this work, look forward to your FLIP.
>>>
>>> Best,
>>> Ron
>>>
>>> Venkatakrishnan Sowrirajan  于2023年8月13日周日 14:34写道:
>>>
>>> > Thanks Yunhong. That's correct. I am able to make it work locally.
>>> > Currently, in the process of writing a FLIP for the necessary changes
>>> to
>>> > the SupportsFilterPushDown API to support nested fields filter push
>>> down.
>>> >
>>> > Regards
>>> > Venkata krishnan
>>> >
>>> >
>>> > On Mon, Aug 7, 2023 at 8:28 PM yh z  wrote:
>>> >
>>> > > Hi Venkatakrishnan,
>>> > > Sorry for the late reply. I have looked at the code and feel like you
>>> > need
>>> > > to modify the logic of the
>>> > > ExpressionConverter.visit(FieldReferenceExpression expression)
>>> method to
>>> > > support nested types,
>>> > > which are not currently supported in currently code.
>>> > >
>>> > > Regards,
>>> > > Yunhong Zheng (Swuferhong)
>>> > >
>>> > > Venkatakrishnan Sowrirajan  于2023年8月7日周一 13:30写道:
>>> > >
>>> > > > (Sorry, I pressed send too early)
>>> > > >
>>> > > > Thanks for the help @zhengyunhon...@gmail.com.
>>> > > >
>>> > > > Agree on not changing the API as much as possible as well as wrt
>>> > > > simplifying Projection pushdown with nested fields eventually as
>>> well.
>>> > > >
>>> > > > In terms of the code itself, currently I am trying to leverage the
>>> > > > FieldReferenceExpression to also handle nested fields for filter
>>> push
>>> > > down.
>>> > > > But where I am currently struggling to make progress is, once the
>>> > filters
>>> > > > are pushed to the table source itself, in
>>> > > >
>>> >
>>> PushFilterIntoSourceScanRuleBase#resolveFiltersAndCreateTableSourceTable
>>> > > > there is a conversion from List>> > > > FieldReferenceExpression) to the List itself.
>>> > > >
>>> > > > If you have some pointers for that, please let me know. Thanks.
>>> > > >
>>> > > > Regards
>>> > > > Venkata krishnan
>>> > > >
>>> > > >
>>> > > > On Sun, Aug 6, 2023 at 10:23 PM Venkatakrishnan Sowrirajan <
>>> > > > vsowr...@asu.edu>
>>> > > > wrote:
>>> > > >
>>> > > > > Thanks @zhengyunhon...@gmail.com
>>> > > > > Regards
>>> > > > > Venkata krishnan
>>> > > > >
>>> > > > >
>>> > > > > On Sun, Aug 6, 2023 at 6:16 PM yh z 
>>> > wrote:
>>> > > > >
>>> > > > >> Hi, Venkatakrishnan,
>>> > > > >> I think this is a very useful feature. I have been focusing on
>>> the
>>> > > > >> development of the flink-table-planner module recently, so if
>>> you
>>> > need
>>> > > > >> some
>>> > > > >> help, I can assist you in completing the development of some
>>> > sub-tasks
>>> > > > or
>>> > > > >> code review.
>>> > > > >>
>>> > > > >> Returning to the design itself, I think it's necessary to modify
>>> > > > >> FieldReferenceExpression or re-implement a
>>> > > > NestedFieldReferenceExpression.
>>> > > > >> As for modifying the interface of SupportsProjectionPushDown, I
>>> > think
>>> > > we
>>> > > > >> ne

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-16 Thread Venkatakrishnan Sowrirajan
Forgot to share the link -
https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7 in the
last email.

Regards
Venkata krishnan


On Wed, Aug 16, 2023 at 11:55 AM Venkatakrishnan Sowrirajan <
vsowr...@asu.edu> wrote:

> Btw, this is the FLIP proposal discussion thread. Please share your
> thoughts. Thanks.
>
> Regards
> Venkata krishnan
>
>
> On Sun, Aug 13, 2023 at 6:35 AM liu ron  wrote:
>
>> Hi, Venkata krishnan
>>
>> Thanks for driving this work, look forward to your FLIP.
>>
>> Best,
>> Ron
>>
>> Venkatakrishnan Sowrirajan  于2023年8月13日周日 14:34写道:
>>
>> > Thanks Yunhong. That's correct. I am able to make it work locally.
>> > Currently, in the process of writing a FLIP for the necessary changes to
>> > the SupportsFilterPushDown API to support nested fields filter push
>> down.
>> >
>> > Regards
>> > Venkata krishnan
>> >
>> >
>> > On Mon, Aug 7, 2023 at 8:28 PM yh z  wrote:
>> >
>> > > Hi Venkatakrishnan,
>> > > Sorry for the late reply. I have looked at the code and feel like you
>> > need
>> > > to modify the logic of the
>> > > ExpressionConverter.visit(FieldReferenceExpression expression) method
>> to
>> > > support nested types,
>> > > which are not currently supported in currently code.
>> > >
>> > > Regards,
>> > > Yunhong Zheng (Swuferhong)
>> > >
>> > > Venkatakrishnan Sowrirajan  于2023年8月7日周一 13:30写道:
>> > >
>> > > > (Sorry, I pressed send too early)
>> > > >
>> > > > Thanks for the help @zhengyunhon...@gmail.com.
>> > > >
>> > > > Agree on not changing the API as much as possible as well as wrt
>> > > > simplifying Projection pushdown with nested fields eventually as
>> well.
>> > > >
>> > > > In terms of the code itself, currently I am trying to leverage the
>> > > > FieldReferenceExpression to also handle nested fields for filter
>> push
>> > > down.
>> > > > But where I am currently struggling to make progress is, once the
>> > filters
>> > > > are pushed to the table source itself, in
>> > > >
>> > PushFilterIntoSourceScanRuleBase#resolveFiltersAndCreateTableSourceTable
>> > > > there is a conversion from List> > > > FieldReferenceExpression) to the List itself.
>> > > >
>> > > > If you have some pointers for that, please let me know. Thanks.
>> > > >
>> > > > Regards
>> > > > Venkata krishnan
>> > > >
>> > > >
>> > > > On Sun, Aug 6, 2023 at 10:23 PM Venkatakrishnan Sowrirajan <
>> > > > vsowr...@asu.edu>
>> > > > wrote:
>> > > >
>> > > > > Thanks @zhengyunhon...@gmail.com
>> > > > > Regards
>> > > > > Venkata krishnan
>> > > > >
>> > > > >
>> > > > > On Sun, Aug 6, 2023 at 6:16 PM yh z 
>> > wrote:
>> > > > >
>> > > > >> Hi, Venkatakrishnan,
>> > > > >> I think this is a very useful feature. I have been focusing on
>> the
>> > > > >> development of the flink-table-planner module recently, so if you
>> > need
>> > > > >> some
>> > > > >> help, I can assist you in completing the development of some
>> > sub-tasks
>> > > > or
>> > > > >> code review.
>> > > > >>
>> > > > >> Returning to the design itself, I think it's necessary to modify
>> > > > >> FieldReferenceExpression or re-implement a
>> > > > NestedFieldReferenceExpression.
>> > > > >> As for modifying the interface of SupportsProjectionPushDown, I
>> > think
>> > > we
>> > > > >> need to make some trade-offs. As a connector developer, the
>> > stability
>> > > of
>> > > > >> the interface is very important. If there are no unresolved
>> bugs, I
>> > > > >> personally do not recommend modifying the interface. However,
>> when I
>> > > > first
>> > > > >> read the code of SupportsProjectionPushDown, the design of
>> int[][]
>> > was
>> > > > >> very
>> > > > >> confusin

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-16 Thread Venkatakrishnan Sowrirajan
Btw, this is the FLIP proposal discussion thread. Please share your
thoughts. Thanks.

Regards
Venkata krishnan


On Sun, Aug 13, 2023 at 6:35 AM liu ron  wrote:

> Hi, Venkata krishnan
>
> Thanks for driving this work, look forward to your FLIP.
>
> Best,
> Ron
>
> Venkatakrishnan Sowrirajan  于2023年8月13日周日 14:34写道:
>
> > Thanks Yunhong. That's correct. I am able to make it work locally.
> > Currently, in the process of writing a FLIP for the necessary changes to
> > the SupportsFilterPushDown API to support nested fields filter push down.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Mon, Aug 7, 2023 at 8:28 PM yh z  wrote:
> >
> > > Hi Venkatakrishnan,
> > > Sorry for the late reply. I have looked at the code and feel like you
> > need
> > > to modify the logic of the
> > > ExpressionConverter.visit(FieldReferenceExpression expression) method
> to
> > > support nested types,
> > > which are not currently supported in currently code.
> > >
> > > Regards,
> > > Yunhong Zheng (Swuferhong)
> > >
> > > Venkatakrishnan Sowrirajan  于2023年8月7日周一 13:30写道:
> > >
> > > > (Sorry, I pressed send too early)
> > > >
> > > > Thanks for the help @zhengyunhon...@gmail.com.
> > > >
> > > > Agree on not changing the API as much as possible as well as wrt
> > > > simplifying Projection pushdown with nested fields eventually as
> well.
> > > >
> > > > In terms of the code itself, currently I am trying to leverage the
> > > > FieldReferenceExpression to also handle nested fields for filter push
> > > down.
> > > > But where I am currently struggling to make progress is, once the
> > filters
> > > > are pushed to the table source itself, in
> > > >
> > PushFilterIntoSourceScanRuleBase#resolveFiltersAndCreateTableSourceTable
> > > > there is a conversion from List > > > FieldReferenceExpression) to the List itself.
> > > >
> > > > If you have some pointers for that, please let me know. Thanks.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Sun, Aug 6, 2023 at 10:23 PM Venkatakrishnan Sowrirajan <
> > > > vsowr...@asu.edu>
> > > > wrote:
> > > >
> > > > > Thanks @zhengyunhon...@gmail.com
> > > > > Regards
> > > > > Venkata krishnan
> > > > >
> > > > >
> > > > > On Sun, Aug 6, 2023 at 6:16 PM yh z 
> > wrote:
> > > > >
> > > > >> Hi, Venkatakrishnan,
> > > > >> I think this is a very useful feature. I have been focusing on the
> > > > >> development of the flink-table-planner module recently, so if you
> > need
> > > > >> some
> > > > >> help, I can assist you in completing the development of some
> > sub-tasks
> > > > or
> > > > >> code review.
> > > > >>
> > > > >> Returning to the design itself, I think it's necessary to modify
> > > > >> FieldReferenceExpression or re-implement a
> > > > NestedFieldReferenceExpression.
> > > > >> As for modifying the interface of SupportsProjectionPushDown, I
> > think
> > > we
> > > > >> need to make some trade-offs. As a connector developer, the
> > stability
> > > of
> > > > >> the interface is very important. If there are no unresolved bugs,
> I
> > > > >> personally do not recommend modifying the interface. However,
> when I
> > > > first
> > > > >> read the code of SupportsProjectionPushDown, the design of int[][]
> > was
> > > > >> very
> > > > >> confusing for me, and it took me a long time to understand it by
> > > running
> > > > >> specify UT tests. Therefore, in terms of the design of this
> > interface
> > > > and
> > > > >> the consistency between different interfaces, there is indeed room
> > for
> > > > >> improvement it.
> > > > >>
> > > > >> Thanks,
> > > > >> Yunhong Zheng (Swuferhong)
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Becket Qin  于2023年8月3日周四 07:44写道:
> > > > >>
> > &

[DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-15 Thread Venkatakrishnan Sowrirajan
Hi All,

I am opening this thread to discuss FLIP-356: Support Nested Fields Filter
Pushdown. The FLIP can be found at
https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown

This FLIP adds support for pushing down nested fields filters to the
underlying TableSource. In our data lake, we find a lot of datasets have
nested fields and also user queries with filters defined on the nested
fields. This would drastically improve the performance for those sets of
queries.

Appreciate any comments or feedback you may have on this proposal.

Regards
Venkata krishnan


Re: FLINK-20767 - Support for nested fields filter push down

2023-08-13 Thread Venkatakrishnan Sowrirajan
Thanks Yunhong. That's correct. I am able to make it work locally.
Currently, in the process of writing a FLIP for the necessary changes to
the SupportsFilterPushDown API to support nested fields filter push down.

Regards
Venkata krishnan


On Mon, Aug 7, 2023 at 8:28 PM yh z  wrote:

> Hi Venkatakrishnan,
> Sorry for the late reply. I have looked at the code and feel like you need
> to modify the logic of the
> ExpressionConverter.visit(FieldReferenceExpression expression) method to
> support nested types,
> which are not currently supported in currently code.
>
> Regards,
> Yunhong Zheng (Swuferhong)
>
> Venkatakrishnan Sowrirajan  于2023年8月7日周一 13:30写道:
>
> > (Sorry, I pressed send too early)
> >
> > Thanks for the help @zhengyunhon...@gmail.com.
> >
> > Agree on not changing the API as much as possible as well as wrt
> > simplifying Projection pushdown with nested fields eventually as well.
> >
> > In terms of the code itself, currently I am trying to leverage the
> > FieldReferenceExpression to also handle nested fields for filter push
> down.
> > But where I am currently struggling to make progress is, once the filters
> > are pushed to the table source itself, in
> > PushFilterIntoSourceScanRuleBase#resolveFiltersAndCreateTableSourceTable
> > there is a conversion from List > FieldReferenceExpression) to the List itself.
> >
> > If you have some pointers for that, please let me know. Thanks.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Sun, Aug 6, 2023 at 10:23 PM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu>
> > wrote:
> >
> > > Thanks @zhengyunhon...@gmail.com
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Sun, Aug 6, 2023 at 6:16 PM yh z  wrote:
> > >
> > >> Hi, Venkatakrishnan,
> > >> I think this is a very useful feature. I have been focusing on the
> > >> development of the flink-table-planner module recently, so if you need
> > >> some
> > >> help, I can assist you in completing the development of some sub-tasks
> > or
> > >> code review.
> > >>
> > >> Returning to the design itself, I think it's necessary to modify
> > >> FieldReferenceExpression or re-implement a
> > NestedFieldReferenceExpression.
> > >> As for modifying the interface of SupportsProjectionPushDown, I think
> we
> > >> need to make some trade-offs. As a connector developer, the stability
> of
> > >> the interface is very important. If there are no unresolved bugs, I
> > >> personally do not recommend modifying the interface. However, when I
> > first
> > >> read the code of SupportsProjectionPushDown, the design of int[][] was
> > >> very
> > >> confusing for me, and it took me a long time to understand it by
> running
> > >> specify UT tests. Therefore, in terms of the design of this interface
> > and
> > >> the consistency between different interfaces, there is indeed room for
> > >> improvement it.
> > >>
> > >> Thanks,
> > >> Yunhong Zheng (Swuferhong)
> > >>
> > >>
> > >>
> > >>
> > >> Becket Qin  于2023年8月3日周四 07:44写道:
> > >>
> > >> > Hi Jark,
> > >> >
> > >> > If the FieldReferenceExpression contains an int[] to support a
> nested
> > >> field
> > >> > reference, List (or
> > >> FieldReferenceExpression[])
> > >> > and int[][] are actually equivalent. If we are designing this from
> > >> scratch,
> > >> > personally I prefer using List for
> > >> consistency,
> > >> > i.e. always resolving everything to expressions for users.
> Projection
> > >> is a
> > >> > simpler case, but should not be a special case. This avoids doing
> the
> > >> same
> > >> > thing in different ways which is also a confusion to the users. To
> me,
> > >> the
> > >> > int[][] format would become kind of a technical debt after we extend
> > the
> > >> > FieldReferenceExpression. Although we don't have to address it right
> > >> away
> > >> > in the same FLIP, this kind of debt accumulates over time and makes
> > the
> > >> > project harder to learn and maintain. So, personally I prefer to
> > address
> > >> > these technical debts as soon as po

Re: [DISCUSS] FLIP-323: Support Attached Execution on Flink Application Completion for Batch Jobs

2023-08-08 Thread Venkatakrishnan Sowrirajan
This is definitely a useful feature especially for the flink batch
execution workloads using flow orchestrators like Airflow, Azkaban, Oozie
etc. Thanks for reviving this issue and starting a FLIP.

Regards
Venkata krishnan


On Mon, Aug 7, 2023 at 4:09 PM Allison Chang 
wrote:

> Hi all,
>
> I am opening this thread to discuss this proposal to support attached
> execution on Flink Application Completion for Batch Jobs. The link to the
> FLIP proposal is here:
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-323*3A*Support*Attached*Execution*on*Flink*Application*Completion*for*Batch*Jobs__;JSsrKysrKysrKys!!IKRxdwAv5BmarQ!friFO6bJub5FKSLhPIzA6kv-7uffv-zXlv9ZLMKqj_xMcmZl62HhsgvwDXSCS5hfSeyHZgoAVSFg3fk7ChaAFNKi$
>
> This FLIP proposes adding back attached execution for Application Mode. In
> the past attached execution was supported for the per-job mode, which will
> be deprecated and we want to include this feature back into Application
> mode.
>
> Please reply to this email thread and share your thoughts/opinions.
>
> Thank you!
>
> Allison Chang
>


Re: FLINK-20767 - Support for nested fields filter push down

2023-08-06 Thread Venkatakrishnan Sowrirajan
(Sorry, I pressed send too early)

Thanks for the help @zhengyunhon...@gmail.com.

Agree on not changing the API as much as possible as well as wrt
simplifying Projection pushdown with nested fields eventually as well.

In terms of the code itself, currently I am trying to leverage the
FieldReferenceExpression to also handle nested fields for filter push down.
But where I am currently struggling to make progress is, once the filters
are pushed to the table source itself, in
PushFilterIntoSourceScanRuleBase#resolveFiltersAndCreateTableSourceTable
there is a conversion from List itself.

If you have some pointers for that, please let me know. Thanks.

Regards
Venkata krishnan


On Sun, Aug 6, 2023 at 10:23 PM Venkatakrishnan Sowrirajan 
wrote:

> Thanks @zhengyunhon...@gmail.com
> Regards
> Venkata krishnan
>
>
> On Sun, Aug 6, 2023 at 6:16 PM yh z  wrote:
>
>> Hi, Venkatakrishnan,
>> I think this is a very useful feature. I have been focusing on the
>> development of the flink-table-planner module recently, so if you need
>> some
>> help, I can assist you in completing the development of some sub-tasks or
>> code review.
>>
>> Returning to the design itself, I think it's necessary to modify
>> FieldReferenceExpression or re-implement a NestedFieldReferenceExpression.
>> As for modifying the interface of SupportsProjectionPushDown, I think we
>> need to make some trade-offs. As a connector developer, the stability of
>> the interface is very important. If there are no unresolved bugs, I
>> personally do not recommend modifying the interface. However, when I first
>> read the code of SupportsProjectionPushDown, the design of int[][] was
>> very
>> confusing for me, and it took me a long time to understand it by running
>> specify UT tests. Therefore, in terms of the design of this interface and
>> the consistency between different interfaces, there is indeed room for
>> improvement it.
>>
>> Thanks,
>> Yunhong Zheng (Swuferhong)
>>
>>
>>
>>
>> Becket Qin  于2023年8月3日周四 07:44写道:
>>
>> > Hi Jark,
>> >
>> > If the FieldReferenceExpression contains an int[] to support a nested
>> field
>> > reference, List (or
>> FieldReferenceExpression[])
>> > and int[][] are actually equivalent. If we are designing this from
>> scratch,
>> > personally I prefer using List for
>> consistency,
>> > i.e. always resolving everything to expressions for users. Projection
>> is a
>> > simpler case, but should not be a special case. This avoids doing the
>> same
>> > thing in different ways which is also a confusion to the users. To me,
>> the
>> > int[][] format would become kind of a technical debt after we extend the
>> > FieldReferenceExpression. Although we don't have to address it right
>> away
>> > in the same FLIP, this kind of debt accumulates over time and makes the
>> > project harder to learn and maintain. So, personally I prefer to address
>> > these technical debts as soon as possible.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Wed, Aug 2, 2023 at 8:19 PM Jark Wu  wrote:
>> >
>> > > Hi,
>> > >
>> > > I agree with Becket that we may need to extend
>> FieldReferenceExpression
>> > to
>> > > support nested field access (or maybe a new
>> > > NestedFieldReferenceExpression).
>> > > But I have some concerns about evolving the
>> > > SupportsProjectionPushDown.applyProjection.
>> > > A projection is much simpler than Filter Expression which only needs
>> to
>> > > represent the field indexes.
>> > > If we evolve `applyProjection` to accept
>> `List
>> > > projectedFields`,
>> > > users have to convert the `List` back to
>> > int[][]
>> > > which is an overhead for users.
>> > > Field indexes (int[][]) is required to project schemas with the
>> > > utility org.apache.flink.table.connector.Projection.
>> > >
>> > >
>> > > Best,
>> > > Jark
>> > >
>> > >
>> > >
>> > > On Wed, 2 Aug 2023 at 07:40, Venkatakrishnan Sowrirajan <
>> > vsowr...@asu.edu>
>> > > wrote:
>> > >
>> > > > Thanks Becket for the suggestion. That makes sense. Let me try it
>> out
>> > and
>> > > > get back to you.
>> > > >
>> > > > Regards
>> > > > Venkata krishnan
>>

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-06 Thread Venkatakrishnan Sowrirajan
Thanks @zhengyunhon...@gmail.com
Regards
Venkata krishnan


On Sun, Aug 6, 2023 at 6:16 PM yh z  wrote:

> Hi, Venkatakrishnan,
> I think this is a very useful feature. I have been focusing on the
> development of the flink-table-planner module recently, so if you need some
> help, I can assist you in completing the development of some sub-tasks or
> code review.
>
> Returning to the design itself, I think it's necessary to modify
> FieldReferenceExpression or re-implement a NestedFieldReferenceExpression.
> As for modifying the interface of SupportsProjectionPushDown, I think we
> need to make some trade-offs. As a connector developer, the stability of
> the interface is very important. If there are no unresolved bugs, I
> personally do not recommend modifying the interface. However, when I first
> read the code of SupportsProjectionPushDown, the design of int[][] was very
> confusing for me, and it took me a long time to understand it by running
> specify UT tests. Therefore, in terms of the design of this interface and
> the consistency between different interfaces, there is indeed room for
> improvement it.
>
> Thanks,
> Yunhong Zheng (Swuferhong)
>
>
>
>
> Becket Qin  于2023年8月3日周四 07:44写道:
>
> > Hi Jark,
> >
> > If the FieldReferenceExpression contains an int[] to support a nested
> field
> > reference, List (or FieldReferenceExpression[])
> > and int[][] are actually equivalent. If we are designing this from
> scratch,
> > personally I prefer using List for consistency,
> > i.e. always resolving everything to expressions for users. Projection is
> a
> > simpler case, but should not be a special case. This avoids doing the
> same
> > thing in different ways which is also a confusion to the users. To me,
> the
> > int[][] format would become kind of a technical debt after we extend the
> > FieldReferenceExpression. Although we don't have to address it right away
> > in the same FLIP, this kind of debt accumulates over time and makes the
> > project harder to learn and maintain. So, personally I prefer to address
> > these technical debts as soon as possible.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Aug 2, 2023 at 8:19 PM Jark Wu  wrote:
> >
> > > Hi,
> > >
> > > I agree with Becket that we may need to extend FieldReferenceExpression
> > to
> > > support nested field access (or maybe a new
> > > NestedFieldReferenceExpression).
> > > But I have some concerns about evolving the
> > > SupportsProjectionPushDown.applyProjection.
> > > A projection is much simpler than Filter Expression which only needs to
> > > represent the field indexes.
> > > If we evolve `applyProjection` to accept
> `List
> > > projectedFields`,
> > > users have to convert the `List` back to
> > int[][]
> > > which is an overhead for users.
> > > Field indexes (int[][]) is required to project schemas with the
> > > utility org.apache.flink.table.connector.Projection.
> > >
> > >
> > > Best,
> > > Jark
> > >
> > >
> > >
> > > On Wed, 2 Aug 2023 at 07:40, Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu>
> > > wrote:
> > >
> > > > Thanks Becket for the suggestion. That makes sense. Let me try it out
> > and
> > > > get back to you.
> > > >
> > > > Regards
> > > > Venkata krishnan
> > > >
> > > >
> > > > On Tue, Aug 1, 2023 at 9:04 AM Becket Qin 
> > wrote:
> > > >
> > > > > This is a very useful feature in practice.
> > > > >
> > > > > It looks to me that the key issue here is that Flink
> > ResolvedExpression
> > > > > does not have necessary abstraction for nested field access. So the
> > > > Calcite
> > > > > RexFieldAccess does not have a counterpart in the
> ResolvedExpression.
> > > The
> > > > > FieldReferenceExpression only supports direct access to the fields,
> > not
> > > > > nested access.
> > > > >
> > > > > Theoretically speaking, this nested field reference is also
> required
> > by
> > > > > projection pushdown. However, we addressed that by using an int[][]
> > in
> > > > the
> > > > > SupportsProjectionPushDown interface. Maybe we can do the
> following:
> > > > >
> > > > > 1. Extend the FieldReferenceExpression to include an int[] for
> nested
> > >

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-01 Thread Venkatakrishnan Sowrirajan
Thanks Becket for the suggestion. That makes sense. Let me try it out and
get back to you.

Regards
Venkata krishnan


On Tue, Aug 1, 2023 at 9:04 AM Becket Qin  wrote:

> This is a very useful feature in practice.
>
> It looks to me that the key issue here is that Flink ResolvedExpression
> does not have necessary abstraction for nested field access. So the Calcite
> RexFieldAccess does not have a counterpart in the ResolvedExpression. The
> FieldReferenceExpression only supports direct access to the fields, not
> nested access.
>
> Theoretically speaking, this nested field reference is also required by
> projection pushdown. However, we addressed that by using an int[][] in the
> SupportsProjectionPushDown interface. Maybe we can do the following:
>
> 1. Extend the FieldReferenceExpression to include an int[] for nested field
> access,
> 2. By doing (1),
> SupportsFilterPushDown#applyFilters(List) can support
> nested field access.
> 3. Evolve the SupportsProjectionPushDown.applyProjection(int[][]
> projectedFields, DataType producedDataType) to
> applyProjection(List projectedFields, DataType
> producedDataType)
>
> This will need a FLIP.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Aug 1, 2023 at 11:42 PM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> wrote:
>
> > Thanks for the response. Looking forward to your pointers. In the
> > meanwhile, let me figure out how we can implement it. Will keep you
> posted.
> >
> > On Mon, Jul 31, 2023, 11:43 PM liu ron  wrote:
> >
> > > Hi, Venkata
> > >
> > > Thanks for reporting this issue. Currently, Flink doesn't support
> nested
> > > filter pushdown. I also think that this optimization would be useful,
> > > especially for jobs, which may need to read a lot of data from the
> > parquet
> > > or orc file. We didn't move forward with this for some priority
> reasons.
> > >
> > > Regarding your three questions, I will respond to you later after my
> > > on-call is finished because I need to dive into the source code. About
> > your
> > > commit, I don't think it's the right solution because
> > > FieldReferenceExpression doesn't currently support nested field filter
> > > pushdown, maybe we need to extend it.
> > >
> > > You can also look further into reasonable solutions, which we'll
> discuss
> > > further later on.
> > >
> > > Best,
> > > Ron
> > >
> > >
> > > Venkatakrishnan Sowrirajan  于2023年7月29日周六 03:31写道:
> > >
> > > > Hi all,
> > > >
> > > > Currently, I am working on adding support for nested fields filter
> push
> > > > down. In our use case running Flink on Batch, we found nested fields
> > > filter
> > > > push down is key - without it, it is significantly slow. Note: Spark
> > SQL
> > > > supports nested fields filter push down.
> > > >
> > > > While debugging the code using IcebergTableSource as the table
> source,
> > > > narrowed down the issue to missing support for
> > > > RexNodeExtractor#RexNodeToExpressionConverter#visitFieldAccess.
> > > > As part of fixing it, I made changes by returning an
> > > > Option(FieldReferenceExpression)
> > > > with appropriate reference to the parent index and the child index
> for
> > > the
> > > > nested field with the data type info.
> > > >
> > > > But this new ResolvedExpression cannot be converted to RexNode which
> > > > happens in PushFilterIntoSourceScanRuleBase
> > > > <
> > > >
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushFilterIntoSourceScanRuleBase.java*L104__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Q5cMnVs$
> > > > >
> > > > .
> > > >
> > > > Few questions
> > > >
> > > > 1. Does FieldReferenceExpression support nested fields currently or
> > > should
> > > > it be extended to support nested fields? I couldn't figure this out
> > from
> > > > the PushProjectIntoTableScanRule that supports nested column
> projection
> > > > push down.
> > > > 2. ExpressionConverter
> > > > <
> > > >
> > >
> >
> https://urldefense.com/v3/__htt

Re: FLINK-20767 - Support for nested fields filter push down

2023-08-01 Thread Venkatakrishnan Sowrirajan
Thanks for the response. Looking forward to your pointers. In the
meanwhile, let me figure out how we can implement it. Will keep you posted.

On Mon, Jul 31, 2023, 11:43 PM liu ron  wrote:

> Hi, Venkata
>
> Thanks for reporting this issue. Currently, Flink doesn't support nested
> filter pushdown. I also think that this optimization would be useful,
> especially for jobs, which may need to read a lot of data from the parquet
> or orc file. We didn't move forward with this for some priority reasons.
>
> Regarding your three questions, I will respond to you later after my
> on-call is finished because I need to dive into the source code. About your
> commit, I don't think it's the right solution because
> FieldReferenceExpression doesn't currently support nested field filter
> pushdown, maybe we need to extend it.
>
> You can also look further into reasonable solutions, which we'll discuss
> further later on.
>
> Best,
> Ron
>
>
> Venkatakrishnan Sowrirajan  于2023年7月29日周六 03:31写道:
>
> > Hi all,
> >
> > Currently, I am working on adding support for nested fields filter push
> > down. In our use case running Flink on Batch, we found nested fields
> filter
> > push down is key - without it, it is significantly slow. Note: Spark SQL
> > supports nested fields filter push down.
> >
> > While debugging the code using IcebergTableSource as the table source,
> > narrowed down the issue to missing support for
> > RexNodeExtractor#RexNodeToExpressionConverter#visitFieldAccess.
> > As part of fixing it, I made changes by returning an
> > Option(FieldReferenceExpression)
> > with appropriate reference to the parent index and the child index for
> the
> > nested field with the data type info.
> >
> > But this new ResolvedExpression cannot be converted to RexNode which
> > happens in PushFilterIntoSourceScanRuleBase
> > <
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushFilterIntoSourceScanRuleBase.java*L104__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Q5cMnVs$
> > >
> > .
> >
> > Few questions
> >
> > 1. Does FieldReferenceExpression support nested fields currently or
> should
> > it be extended to support nested fields? I couldn't figure this out from
> > the PushProjectIntoTableScanRule that supports nested column projection
> > push down.
> > 2. ExpressionConverter
> > <
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/expressions/converter/ExpressionConverter.java*L197__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Z6jnkJm$
> > >
> > converts ResolvedExpression -> RexNode but the new
> FieldReferenceExpression
> > with the nested field cannot be converted to RexNode. This is why the
> > answer to the 1st question is key.
> > 3. Anything else that I'm missing here? or is there an even easier way to
> > add support for nested fields filter push down?
> >
> > Partially working changes - Commit
> > <
> >
> https://urldefense.com/v3/__https://github.com/venkata91/flink/commit/00cdf34ecf9be3ba669a97baaed4b69b85cd26f9__;!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9XeOjJ_a$
> > >
> > Please
> > feel free to leave a comment directly in the commit.
> >
> > Any pointers here would be much appreciated! Thanks in advance.
> >
> > Disclaimer: Relatively new to Flink code base especially Table planner
> :-).
> >
> > Regards
> > Venkata krishnan
> >
>


FLINK-20767 - Support for nested fields filter push down

2023-07-28 Thread Venkatakrishnan Sowrirajan
Hi all,

Currently, I am working on adding support for nested fields filter push
down. In our use case running Flink on Batch, we found nested fields filter
push down is key - without it, it is significantly slow. Note: Spark SQL
supports nested fields filter push down.

While debugging the code using IcebergTableSource as the table source,
narrowed down the issue to missing support for
RexNodeExtractor#RexNodeToExpressionConverter#visitFieldAccess.
As part of fixing it, I made changes by returning an
Option(FieldReferenceExpression)
with appropriate reference to the parent index and the child index for the
nested field with the data type info.

But this new ResolvedExpression cannot be converted to RexNode which
happens in PushFilterIntoSourceScanRuleBase

.

Few questions

1. Does FieldReferenceExpression support nested fields currently or should
it be extended to support nested fields? I couldn't figure this out from
the PushProjectIntoTableScanRule that supports nested column projection
push down.
2. ExpressionConverter

converts ResolvedExpression -> RexNode but the new FieldReferenceExpression
with the nested field cannot be converted to RexNode. This is why the
answer to the 1st question is key.
3. Anything else that I'm missing here? or is there an even easier way to
add support for nested fields filter push down?

Partially working changes - Commit

Please
feel free to leave a comment directly in the commit.

Any pointers here would be much appreciated! Thanks in advance.

Disclaimer: Relatively new to Flink code base especially Table planner :-).

Regards
Venkata krishnan


Re: [VOTE] FLIP-312: Add Yarn ACLs to Flink Containers

2023-06-05 Thread Venkatakrishnan Sowrirajan
Thanks for starting the vote on this one, Archit.

+1 (non-binding)

Regards
Venkata krishnan


On Mon, Jun 5, 2023 at 9:55 AM Archit Goyal 
wrote:

> Hi everyone,
>
> Thanks for all the feedback for FLIP-312: Add Yarn ACLs to Flink
> Containers<
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4oibXUIM$
> >.
> Following is the discussion thread : Link<
> https://urldefense.com/v3/__https://lists.apache.org/thread/xj3ytkwj9lsl3hpjdb4n8pmy7lk3l8tv__;!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4u3tNMqI$
> >
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours (until June 9th, 12:00AM GMT) unless there is an objection or an
> insufficient number of votes.
>
> Thanks,
> Archit Goyal
>


Re: [DISCUSS] FLIP-312: Add Yarn ACLs to Flink Containers

2023-05-12 Thread Venkatakrishnan Sowrirajan
Thanks for the FLIP, Archit.

+1 from me as well. This would be very useful for us and others in the
community given the same issue was raised earlier as well.

Regards
Venkata krishnan


On Fri, May 12, 2023 at 4:03 PM Becket Qin  wrote:

> Thanks for the FLIP, Archit.
>
> The motivation sounds reasonable and it looks like a straightforward
> proposal. +1 from me.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, May 12, 2023 at 1:30 AM Archit Goyal  >
> wrote:
>
> > Hi all,
> >
> > I am opening this thread to discuss the proposal to support Yarn ACLs to
> > Flink containers which has been documented in FLIP-312 <
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!bQiA3GX9bFf-w6A9M4Aez7RSMYLdvFtjZnlrOSf6N2nQUFuDdnoJ20uujW8RPY1VbLS9P4AfpnqPmkZZOuQ$
> > >.
> >
> > This FLIP mentions about providing Yarn application ACL mechanism on
> Flink
> > containers to be able to provide specific rights to users other than the
> > one running the Flink application job. This will restrict other users in
> > two ways:
> >
> >   *   view logs through the Resource Manager job history
> >   *   kill the application
> >
> > Please feel free to reply to this email thread and share your opinions.
> >
> > Thanks,
> > Archit Goyal
> >
> >
>