Re: [ANNOUNCE] New Apache Flink Committer - Hangxiang Yu

2023-08-14 Thread Roman Khachatryan
Congratulations, Hangxiang!

Regards,
Roman


On Wed, Aug 9, 2023 at 12:49 PM Benchao Li  wrote:

> Congrats, Hangxiang!
>
> Jing Ge  于2023年8月8日周二 17:44写道:
>
> > Congrats, Hangxiang!
> >
> > Best regards,
> > Jing
> >
> > On Tue, Aug 8, 2023 at 3:04 PM Yangze Guo  wrote:
> >
> > > Congrats, Hangxiang!
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Aug 8, 2023 at 11:28 AM yh z  wrote:
> > > >
> > > > Congratulations, Hangxiang !
> > > >
> > > >
> > > > Best,
> > > > Yunhong Zheng (Swuferhong)
> > > >
> > > > yuxia  于2023年8月8日周二 09:20写道:
> > > >
> > > > > Congratulations, Hangxiang !
> > > > >
> > > > > Best regards,
> > > > > Yuxia
> > > > >
> > > > > - 原始邮件 -
> > > > > 发件人: "Wencong Liu" 
> > > > > 收件人: "dev" 
> > > > > 发送时间: 星期一, 2023年 8 月 07日 下午 11:55:24
> > > > > 主题: Re:[ANNOUNCE] New Apache Flink Committer - Hangxiang Yu
> > > > >
> > > > > Congratulations, Hangxiang !
> > > > >
> > > > >
> > > > > Best,
> > > > > Wencong
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > At 2023-08-07 14:57:49, "Yuan Mei"  wrote:
> > > > > >On behalf of the PMC, I'm happy to announce Hangxiang Yu as a new
> > > Flink
> > > > > >Committer.
> > > > > >
> > > > > >Hangxiang has been active in the Flink community for more than 1.5
> > > years
> > > > > >and has played an important role in developing and maintaining
> State
> > > and
> > > > > >Checkpoint related features/components, including Generic
> > Incremental
> > > > > >Checkpoints (take great efforts to make the feature prod-ready).
> > > Hangxiang
> > > > > >is also the main driver of the FLIP-263: Resolving schema
> > > compatibility.
> > > > > >
> > > > > >Hangxiang is passionate about the Flink community. Besides the
> > > technical
> > > > > >contribution above, he is also actively promoting Flink: talks
> about
> > > > > Generic
> > > > > >Incremental Checkpoints in Flink Forward and Meet-up. Hangxiang
> also
> > > spent
> > > > > >a good amount of time supporting users, participating in
> > Jira/mailing
> > > list
> > > > > >discussions, and reviewing code.
> > > > > >
> > > > > >Please join me in congratulating Hangxiang for becoming a Flink
> > > Committer!
> > > > > >
> > > > > >Thanks,
> > > > > >Yuan Mei (on behalf of the Flink PMC)
> > > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>


Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei

2023-08-14 Thread Roman Khachatryan
Congratulations, Yanfey!

Regards,
Roman


On Wed, Aug 9, 2023 at 12:49 PM Benchao Li  wrote:

> Congrats, YanFei!
>
> Jing Ge  于2023年8月8日周二 17:41写道:
>
> > Congrats, YanFei!
> >
> > Best regards,
> > Jing
> >
> > On Tue, Aug 8, 2023 at 3:04 PM Yangze Guo  wrote:
> >
> > > Congrats, Yanfei!
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Aug 8, 2023 at 9:20 AM yuxia 
> > wrote:
> > > >
> > > > Congratulations, Yanfei!
> > > >
> > > > Best regards,
> > > > Yuxia
> > > >
> > > > - 原始邮件 -
> > > > 发件人: "ron9 liu" 
> > > > 收件人: "dev" 
> > > > 发送时间: 星期一, 2023年 8 月 07日 下午 11:44:23
> > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei
> > > >
> > > > Congratulations Yanfei!
> > > >
> > > > Best,
> > > > Ron
> > > >
> > > > Zakelly Lan  于2023年8月7日周一 23:15写道:
> > > >
> > > > > Congratulations, Yanfei!
> > > > >
> > > > > Best regards,
> > > > > Zakelly
> > > > >
> > > > > On Mon, Aug 7, 2023 at 9:04 PM Lincoln Lee  >
> > > wrote:
> > > > > >
> > > > > > Congratulations, Yanfei!
> > > > > >
> > > > > > Best,
> > > > > > Lincoln Lee
> > > > > >
> > > > > >
> > > > > > Weihua Hu  于2023年8月7日周一 20:43写道:
> > > > > >
> > > > > > > Congratulations Yanfei!
> > > > > > >
> > > > > > > Best,
> > > > > > > Weihua
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Aug 7, 2023 at 8:08 PM Feifan Wang  >
> > > wrote:
> > > > > > >
> > > > > > > > Congratulations Yanfei! :)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ——
> > > > > > > > Name: Feifan Wang
> > > > > > > > Email: zoltar9...@163.com
> > > > > > > >
> > > > > > > >
> > > > > > > >  Replied Message 
> > > > > > > > | From | Matt Wang |
> > > > > > > > | Date | 08/7/2023 19:40 |
> > > > > > > > | To | dev@flink.apache.org |
> > > > > > > > | Subject | Re: [ANNOUNCE] New Apache Flink Committer -
> Yanfei
> > > Lei |
> > > > > > > > Congratulations Yanfei!
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matt Wang
> > > > > > > >
> > > > > > > >
> > > > > > > >  Replied Message 
> > > > > > > > | From | Mang Zhang |
> > > > > > > > | Date | 08/7/2023 18:56 |
> > > > > > > > | To |  |
> > > > > > > > | Subject | Re:Re: [ANNOUNCE] New Apache Flink Committer -
> > Yanfei
> > > > > Lei |
> > > > > > > > Congratulations--
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Mang Zhang
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > 在 2023-08-07 18:17:58,"Yuxin Tan" 
> 写道:
> > > > > > > > Congrats, Yanfei!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yuxin
> > > > > > > >
> > > > > > > >
> > > > > > > > weijie guo  于2023年8月7日周一 17:59写道:
> > > > > > > >
> > > > > > > > Congrats, Yanfei!
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Weijie
> > > > > > > >
> > > > > > > >
> > > > > > > > Biao Geng  于2023年8月7日周一 17:03写道:
> > > > > > > >
> > > > > > > > Congrats, Yanfei!
> > > > > > > > Best,
> > > > > > > > Biao Geng
> > > > > > > >
> > > > > > > > 发送自 Outlook for iOS
> > > > > > > > 
> > > > > > > > 发件人: Qingsheng Ren 
> > > > > > > > 发送时间: Monday, August 7, 2023 4:23:52 PM
> > > > > > > > 收件人: dev@flink.apache.org 
> > > > > > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Yanfei Lei
> > > > > > > >
> > > > > > > > Congratulations and welcome, Yanfei!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Qingsheng
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 4:19 PM Matthias Pohl <
> > > matthias.p...@aiven.io
> > > > > > > > .invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Congratulations, Yanfei! :)
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 10:00 AM Junrui Lee <
> > jrlee@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Congratulations Yanfei!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Junrui
> > > > > > > >
> > > > > > > > Yun Tang  于2023年8月7日周一 15:19写道:
> > > > > > > >
> > > > > > > > Congratulations, Yanfei!
> > > > > > > >
> > > > > > > > Best
> > > > > > > > Yun Tang
> > > > > > > > 
> > > > > > > > From: Danny Cranmer 
> > > > > > > > Sent: Monday, August 7, 2023 15:10
> > > > > > > > To: dev 
> > > > > > > > Subject: Re: [ANNOUNCE] New Apache Flink Committer - Yanfei
> Lei
> > > > > > > >
> > > > > > > > Congrats Yanfei! Welcome to the team.
> > > > > > > >
> > > > > > > > Danny
> > > > > > > >
> > > > > > > > On Mon, 7 Aug 2023, 08:03 Rui Fan, <1996fan...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > Congratulations Yanfei!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Rui
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 2:56 PM Yuan Mei <
> > yuanmei.w...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > On behalf of the PMC, I'm happy to announce Yanfei Lei as a
> new
> > > > > > > > Flink
> > > > > > > > Committer.
> > > > > > > >
> > > > > > > > Yan

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-16 Thread Roman Khachatryan
Thanks for the proposal,

Starting with the minimal functionality and expanding if necessary as the
FLIP describes makes a lot of sense to me.

Regards,
Roman

On Wed, Nov 15, 2023, 9:31 PM Jing Ge  wrote:

> Hi Piotr,
>
> Sorry for the late reply and thanks for the proposal, it looks awesome!
>
> In the discussion, you pointed out that it is difficult to build true
> distributed traces. afaiu from FLIP-384 and FLIP-385, the
> upcoming OpenTelemetry based TraceReporter will use the same Span
> implementation and will not support trace_id and span_id. Does it make
> sense to at least add the span_id into the current Span design? The default
> implementation could follow your suggestion: jobId#attemptId#checkpointId.
>
> Some other NIT questions:
> 1. The sample code shows that the scope of Span will be the CanonicalName
> of a class. If there are other cases that could be used as the scope too, a
> javadoc about Span scope would be helpful. If the CanonicalName of a class
> is the best practice, removing the scope from the builder constructor and
> adding setScope(Class) might ease the API usage. The Span.getScope() can
> still return String.
> 2. The sample code in the FLIP is not consistent. The first example used
> Span.builder(..) and the second example used new Span() with setters.
> 3. I guess the constructor of SpanBuilder class is a typo.
>
> Really a nice idea to introduce the trace report! Thanks again!
>
> Best regards,
> Jing
>
> On Tue, Nov 14, 2023 at 3:16 PM Piotr Nowojski 
> wrote:
>
> > Hi All,
> >
> > Thanks for the answers!
> >
> > Unless there are some objections or suggestions, I will open a voting
> > thread later this
> > week.
> >
> > > My original thought was to show how much time a sampled record is
> > processed
> > > within each operator in stream processing. By saying 'sampled', I mean
> we
> > > won't generate a trace for every record due to the high cost involved.
> > > Instead, we could only trace ONE record from source when the user
> > requests
> > > it (via REST API or Web UI) or when triggered periodically at a very
> low
> > > frequency.
> >
> > That would be useful, but another issue is that we can not measure time
> > reliably at the
> > granularity of a single record. Time to process a single record by the
> > whole operator
> > chain is usually faster compared to the syscalls to measure time.
> >
> > So I think we are stuck with sample based profilers, like Flame Graphs
> > generated by
> > the Flink WebUI.
> >
> > Best, Piotrek
> >
> > czw., 9 lis 2023 o 05:32 Rui Fan <1996fan...@gmail.com> napisał(a):
> >
> > > Hi Piotr:
> > >
> > > Thanks for your reply!
> > >
> > > > About structured logging (basically events?) I vaguely remember some
> > > > discussions about that. It might be a much larger topic, so I would
> > > prefer
> > > > to leave it out of the scope of this FLIP.
> > >
> > > Sounds make sense to me!
> > >
> > > > I think those could be indeed useful. If you would like to contribute
> > to
> > > them
> > > > in the future, I would be happy to review the FLIP for it :)
> > >
> > > Thank you, after this FLIP, I or my colleagues can pick it up!
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Nov 9, 2023 at 11:39 AM Zakelly Lan 
> > wrote:
> > >
> > > > Hi Piotr,
> > > >
> > > > Thanks for your detailed explanation! I could see the challenge of
> > > > implementing traces with multiple spans and agree to put it in the
> > future
> > > > work. I personally prefer the idea of generating multi span traces
> for
> > > > checkpoints on the JM only.
> > > >
> > > > > I'm not sure if I understand the proposal - I don't know how traces
> > > could
> > > > > be used for this purpose?
> > > > > Traces are perfect for one of events (like checkpointing, recovery,
> > > etc),
> > > > > not for continuous monitoring
> > > > > (like processing records). That's what metrics are. Creating trace
> > and
> > > > > span(s) per each record would
> > > > > be prohibitively expensive.
> > > >
> > > > My original thought was to show how much time a sampled record is
> > > processed
> > > > within each operator in stream processing. By saying 'sampled', I
> mean
> > we
> > > > won't generate a trace for every record due to the high cost
> involved.
> > > > Instead, we could only trace ONE record from source when the user
> > > requests
> > > > it (via REST API or Web UI) or when triggered periodically at a very
> > low
> > > > frequency. However after re-thinking my idea, I realized it's hard to
> > > > define the whole lifecycle of a record since it is transformed into
> > > > different forms among operators. We could discuss this in future
> after
> > > the
> > > > basic trace is implemented in Flink.
> > > >
> > > > > Unless you mean in batch/bounded jobs? Then yes, we could create a
> > > > bounded
> > > > > job trace, with spans
> > > > > for every stage/task/subtask.
> > > >
> > > > Oh yes, batch jobs could definitely leverage the trace.
> > > >
> > > > Best,
> > > > Zakelly
> 

Re: [DISCUSS] FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter

2023-11-16 Thread Roman Khachatryan
Thanks Piotr, the proposal totally makes sense to me.

Does it depend on FLIP-384 for voting?
Otherwise, we could probably start the vote already as there're no counter
proposals or objections.

Regards,
Roman

On Tue, Nov 7, 2023, 1:19 PM Piotr Nowojski 
wrote:

> Hey, sorry for the misclick. Fixed.
>
> wt., 7 lis 2023 o 14:00 Timo Walther  napisał(a):
>
> > Thanks for the FLIP, Piotr.
> >
> > In order to follow the FLIP process[1], please prefix the email subject
> > with "[DISCUSS]".
> >
> > Also, some people might have added filters to their email clients to
> > highlight those discussions.
> >
> > Thanks,
> > Timo
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals#FlinkImprovementProposals-Process
> >
> > On 07.11.23 09:35, Piotr Nowojski wrote:
> > > Hi all!
> > >
> > > I would like to start a discussion on a follow up of FLIP-384:
> Introduce
> > > TraceReporter and use it to create checkpointing and recovery traces
> [1]:
> > >
> > > *FLIP-385: Add OpenTelemetryTraceReporter and
> > OpenTelemetryMetricReporter[
> > > 2]*
> > >
> > > This FLIP proposes to add both MetricReporter and TraceReporter
> > integrating
> > > Flink with OpenTelemetry [4].
> > >
> > > There is also another follow up FLIP-386 [3], which improves recovery
> > > traces.
> > >
> > > Please let me know what you think!
> > >
> > > Best,
> > > Piotr Nowojski
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > > [3]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > > [4] https://opentelemetry.io/
> > >
> >
> >
>


Re: [DISCUSS] FLIP-386: Support adding custom metrics in Recovery Spans

2023-11-16 Thread Roman Khachatryan
Thanks for the proposal,

Can you please explain:
1. why the existing MetricGroup interface can't be used? It already had
methods to add metrics and spans ...

2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
the backend report them as spans?

3. How is the implementation supposed to infer that some metric is a part
of initialization (and make the corresponding RPC to JM?). Should the
interfaces be more explicit about that?

4. What do you think about using histogram or percentiles instead of
min/max/sum/avg? That would be more informative

I like the idea of introducing parameter objects for backend creation.

Regards,
Roman

On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski  wrote:

> (Fixing topic)
>
> wt., 7 lis 2023 o 09:40 Piotr Nowojski  napisał(a):
>
> > Hi all!
> >
> > I would like to start a discussion on a follow up of FLIP-384: Introduce
> > TraceReporter and use it to create checkpointing and recovery traces [1]:
> >
> > *FLIP-386: Support adding custom metrics in Recovery Spans [2]*
> >
> > This FLIP adds a functionality that will allow state backends to attach
> > custom metrics to the recovery/initialization traces. This requires
> changes
> > to the `@PublicEvolving` `StateBackend` API, and it will be initially
> used
> > in `RocksDBIncrementalRestoreOperation` to measure how long does it take
> to
> > download remote files and separately how long does it take to load those
> > files into the local RocksDB instance.
> >
> > Please let me know what you think!
> >
> > Best,
> > Piotr Nowojski
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Anton Kalashnikov

2023-02-21 Thread Roman Khachatryan
Congratulations Anton, well deserved!

Regards,
Roman


On Tue, Feb 21, 2023 at 9:34 AM Martijn Visser 
wrote:

> Congratulations Anton!
>
> On Tue, Feb 21, 2023 at 8:08 AM Lincoln Lee 
> wrote:
>
> > Congratulations, Anton!
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Guowei Ma  于2023年2月21日周二 15:05写道:
> >
> > > Congratulations, Anton!
> > >
> > > Best,
> > > Guowei
> > >
> > >
> > > On Tue, Feb 21, 2023 at 1:52 PM Shammon FY  wrote:
> > >
> > > > Congratulations, Anton!
> > > >
> > > > Best,
> > > > Shammon
> > > >
> > > > On Tue, Feb 21, 2023 at 1:41 PM Sergey Nuyanzin  >
> > > > wrote:
> > > >
> > > > > Congratulations, Anton!
> > > > >
> > > > > On Tue, Feb 21, 2023 at 4:53 AM Weihua Hu 
> > > > wrote:
> > > > >
> > > > > > Congratulations, Anton!
> > > > > >
> > > > > > Best,
> > > > > > Weihua
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 11:22 AM weijie guo <
> > > guoweijieres...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Congratulations, Anton!
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Weijie
> > > > > > >
> > > > > > >
> > > > > > > Leonard Xu  于2023年2月21日周二 11:02写道:
> > > > > > >
> > > > > > > > Congratulations, Anton!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Leonard
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Feb 21, 2023, at 10:02 AM, Rui Fan 
> > > wrote:
> > > > > > > > >
> > > > > > > > > Congratulations, Anton!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Rui Fan
> > > > > > > > >
> > > > > > > > > On Tue, Feb 21, 2023 at 9:23 AM yuxia <
> > > > luoyu...@alumni.sjtu.edu.cn
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Congrats Anton!
> > > > > > > > >>
> > > > > > > > >> Best regards,
> > > > > > > > >> Yuxia
> > > > > > > > >>
> > > > > > > > >> - 原始邮件 -
> > > > > > > > >> 发件人: "Matthias Pohl" 
> > > > > > > > >> 收件人: "dev" 
> > > > > > > > >> 发送时间: 星期二, 2023年 2 月 21日 上午 12:52:40
> > > > > > > > >> 主题: Re: [ANNOUNCE] New Apache Flink Committer - Anton
> > > > Kalashnikov
> > > > > > > > >>
> > > > > > > > >> Congratulations, Anton! :-)
> > > > > > > > >>
> > > > > > > > >> On Mon, Feb 20, 2023 at 5:09 PM Jing Ge
> > > > >  > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> Congrats Anton!
> > > > > > > > >>>
> > > > > > > > >>> On Mon, Feb 20, 2023 at 5:02 PM Samrat Deb <
> > > > > decordea...@gmail.com>
> > > > > > > > >> wrote:
> > > > > > > > >>>
> > > > > > > >  congratulations Anton!
> > > > > > > > 
> > > > > > > >  Bests,
> > > > > > > >  Samrat
> > > > > > > > 
> > > > > > > >  On Mon, 20 Feb 2023 at 9:29 PM, John Roesler <
> > > > > vvcep...@apache.org
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > 
> > > > > > > > > Congratulations, Anton!
> > > > > > > > > -John
> > > > > > > > >
> > > > > > > > > On Mon, Feb 20, 2023, at 08:18, Piotr Nowojski wrote:
> > > > > > > > >> Hi, everyone
> > > > > > > > >>
> > > > > > > > >> On behalf of the PMC, I'm very happy to announce Anton
> > > > > > Kalashnikov
> > > > > > > > >>> as a
> > > > > > > > > new
> > > > > > > > >> Flink Committer.
> > > > > > > > >>
> > > > > > > > >> Anton has been very active for almost two years
> already,
> > > > > > authored
> > > > > > > > >> and
> > > > > > > > >> reviewed many PRs over this time. He is active in the
> > > > Flink's
> > > > > > > > >>> runtime,
> > > > > > > > >> being the main author of improvements like Buffer
> > > Debloating
> > > > > > > > >>> (FLIP-183)
> > > > > > > > >> [1], solved many bugs and fixed many test
> instabilities,
> > > > > > generally
> > > > > > > > > speaking
> > > > > > > > >> helping with the maintenance of runtime components.
> > > > > > > > >>
> > > > > > > > >> Please join me in congratulating Anton Kalashnikov for
> > > > > becoming
> > > > > > a
> > > > > > > > >>> Flink
> > > > > > > > >> Committer!
> > > > > > > > >>
> > > > > > > > >> Best,
> > > > > > > > >> Piotr Nowojski (on behalf of the Flink PMC)
> > > > > > > > >>
> > > > > > > > >> [1]
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > 
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-183%3A+Dynamic+buffer+size+adjustment
> > > > > > > > >
> > > > > > > > 
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Sergey
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Rui Fan

2023-02-21 Thread Roman Khachatryan
Congratulations Rui!

Regards,
Roman


On Mon, Feb 20, 2023 at 5:58 PM Anton Kalashnikov 
wrote:

> Congrats Rui!
>
> --
> Best regards,
> Anton Kalashnikov
>
> On 20.02.23 17:53, Matthias Pohl wrote:
> > Congratulations, Rui :)
> >
> > On Mon, Feb 20, 2023 at 5:10 PM Jing Ge 
> wrote:
> >
> >> Congrats Rui!
> >>
> >> On Mon, Feb 20, 2023 at 3:19 PM Piotr Nowojski 
> >> wrote:
> >>
> >>> Hi, everyone
> >>>
> >>> On behalf of the PMC, I'm very happy to announce Rui Fan as a new Flink
> >>> Committer.
> >>>
> >>> Rui Fan has been active on a small scale since August 2019, and ramped
> up
> >>> his contributions in the 2nd half of 2021. He was mostly involved in
> >> quite
> >>> demanding performance related work around the network stack and
> >>> checkpointing, like re-using TCP connections [1], and many crucial
> >>> improvements to the unaligned checkpoints. Among others: FLIP-227:
> >> Support
> >>> overdraft buffer [2], Merge small ChannelState file for Unaligned
> >>> Checkpoint [3], Timeout aligned to unaligned checkpoint barrier in the
> >>> output buffers [4].
> >>>
> >>> Please join me in congratulating Rui Fan for becoming a Flink
> Committer!
> >>>
> >>> Best,
> >>> Piotr Nowojski (on behalf of the Flink PMC)
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-22643
> >>> [2]
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
> >>> [3] https://issues.apache.org/jira/browse/FLINK-26803
> >>> [4] https://issues.apache.org/jira/browse/FLINK-27251
> >>>
>


Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

2023-02-28 Thread Roman Khachatryan
Hi,

Thanks for the update, I think distinguishing the rescaling behaviour and
the desired parallelism declaration is important.

Having the ability to specify min parallelism might be useful in
environments with multiple jobs: Scheduler will then have an option to stop
the less suitable job.
In other setups, where the job should not be stopped at all, the user can
always set it to 0.

Regards,
Roman


On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels  wrote:

> Hi David,
>
> Thanks for the update! We consider using the new declarative resource
> API for autoscaling. Currently, we treat a scaling decision as a new
> deployment which means surrendering all resources to Kubernetes and
> subsequently reallocating them for the rescaled deployment. The
> declarative resource management API is a great step forward because it
> allows us to do faster and safer rescaling. Faster, because we can
> continue to run while resources are pre-allocated which minimizes
> downtime. Safer, because we can't get stuck when the desired resources
> are not available.
>
> An example with two vertices and their respective parallelisms:
>   v1: 50
>   v2: 10
> Let's assume slot sharing is disabled, so we need 60 task slots to run
> the vertices.
>
> If the autoscaler was to decide to scale up v1 and v2, it could do so
> in a safe way by using min/max configuration:
>   v1: [min: 50, max: 70]
>   v2: [min: 10, max: 20]
> This would then need 90 task slots to run at max capacity.
>
> I suppose we could further remove the min because it would always be
> safer to scale down if resources are not available than to not run at
> all [1]. In fact, I saw that the minimum bound is currently not used
> in the code you posted above [2]. Is that still planned?
>
> -Max
>
> PS: Note that originally we had assumed min == max but I think that
> would be a less safe scaling approach because we would get stuck
> waiting for resources when they are not available, e.g. k8s resource
> limits reached.
>
> [1] However, there might be costs involved with executing the
> rescaling, e.g. for using external storage like s3, especially without
> local recovery.
> [2]
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
>
> On Tue, Feb 28, 2023 at 9:33 AM David Morávek  wrote:
> >
> > Hi Everyone,
> >
> > We had some more talks about the pre-allocation of resources with @Max,
> and
> > here is the final state that we've converged to for now:
> >
> > The vital thing to note about the new API is that it's declarative,
> meaning
> > we're declaring the desired state to which we want our job to converge;
> If,
> > after the requirements update job no longer holds the desired resources
> > (fewer resources than the lower bound), it will be canceled and
> transition
> > back into the waiting for resources state.
> >
> > In some use cases, you might always want to rescale to the upper bound
> > (this goes along the lines of "preallocating resources" and minimizing
> the
> > number of rescales, which is especially useful with the large state).
> This
> > can be controlled by two knobs that already exist:
> >
> > 1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this
> affects
> > a minimal parallelism increase step of a running job; we'll slightly
> change
> > the semantics, and we'll trigger rescaling either once this condition is
> > met or when you hit the ceiling; setting this to the high number will
> > ensure that you always rescale to the upper bound
> >
> > 2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for
> new
> > and already restarting jobs, we'll always respect this timeout, which
> > allows you to wait for more resources even though you already have more
> > resources than defined in the lower bound; again, in the case we reach
> the
> > ceiling (the upper bound), we'll transition into the executing state.
> >
> >
> > We're still planning to dig deeper in this direction with other efforts,
> > but this is already good enough and should allow us to move the FLIP
> > forward.
> >
> > WDYT? Unless there are any objectives against the above, I'd like to
> > proceed to a vote.
> >
> > Best,
> > D.
> >
> > On Thu, Feb 23, 2023 at 5:39 PM David Morávek  wrote:
> >
> > > Hi Everyone,
> > >
> > > @John
> > >
> > > This is a problem that we've spent some time trying to crack; in the
> end,
> > > we've decided to go against doing any upgrades to JobGraphStore from
> > > JobMaster to avoid having multiple writers that are guarded by
> different
> > > leader election lock (Dispatcher and JobMaster might live in a
> different
> > > process). The contract we've decided to choose instead is leveraging
> the
> > > idempotency of the endpoint and having the user of the API retry in
> case
> > > we're unable to persist new requirements in the JobGraphStore [1]. We
> > > eventually need to move JobGraphStore out of the dispatcher, but
> that's way
> > > out of the scope of this FLIP. The solution is a deliberate

Re: [VOTE] FLIP-291: Externalized Declarative Resource Management

2023-03-01 Thread Roman Khachatryan
+1 (binding)

Thanks David, and everyone involved :)

Regards,
Roman


On Wed, Mar 1, 2023 at 8:01 AM Gyula Fóra  wrote:

> +1 (binding)
>
> Looking forward to this :)
>
> Gyula
>
> On Wed, 1 Mar 2023 at 04:02, feng xiangyu  wrote:
>
> > +1  (non-binding)
> >
> > ConradJam  于2023年3月1日周三 10:37写道:
> >
> > > +1  (non-binding)
> > >
> > > Zhanghao Chen  于2023年3月1日周三 10:18写道:
> > >
> > > > Thanks for driving this. +1 (non-binding)
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > 
> > > > From: David Mor?vek 
> > > > Sent: Tuesday, February 28, 2023 21:46
> > > > To: dev 
> > > > Subject: [VOTE] FLIP-291: Externalized Declarative Resource
> Management
> > > >
> > > > Hi Everyone,
> > > >
> > > > I want to start the vote on FLIP-291: Externalized Declarative
> Resource
> > > > Management [1]. The FLIP was discussed in this thread [2].
> > > >
> > > > The goal of the FLIP is to enable external declaration of the
> resource
> > > > requirements of a running job.
> > > >
> > > > The vote will last for at least 72 hours (Friday, 3rd of March, 15:00
> > > CET)
> > > > unless
> > > > there is an objection or insufficient votes.
> > > >
> > > > [1] https://lists.apache.org/thread/b8fnj127jsl5ljg6p4w3c4wvq30cnybh
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > >
> > > > Best,
> > > > D.
> > > >
> > >
> > >
> > > --
> > > Best
> > >
> > > ConradJam
> > >
> >
>


Re: [VOTE] FLIP-304: Pluggable Failure Enrichers

2023-04-20 Thread Roman Khachatryan
+1 (binding)

The FLIP LGTM, thanks Panos!

Regards,
Roman


On Thu, Apr 20, 2023 at 1:33 PM Hong Teoh  wrote:

> +1 (non-binding)
>
> Thank you for driving this effort, Panagiotis.
>
> Regards,
> Hong
>
>
> > On 20 Apr 2023, at 12:16, David Morávek  wrote:
> >
> > Thanks for the update!
> >
> > +1 (binding)
> >
> > Best,
> > D.
> >
> > On Thu, Apr 20, 2023 at 9:50 AM Piotr Nowojski 
> wrote:
> >
> >> Hi,
> >>
> >> I see that the FLIP has been updated, thanks Panos!
> >>
> >> +1 (binding)
> >>
> >> Best,
> >> Piotrek
> >>
> >> śr., 19 kwi 2023 o 13:49 Piotr Nowojski 
> >> napisał(a):
> >>
> >>> +1 to what David wrote. I think we need to update the FLIP and extend
> the
> >>> voting?
> >>>
> >>> Piotrek
> >>>
> >>> śr., 19 kwi 2023 o 09:06 David Morávek  napisał(a):
> >>>
>  Hi Panos,
> 
>  It seems that most recent discussions (e.g. changing the semantics of
> >> the
>  config option) are not reflected in the FLIP. Can you please
> >> double-check
>  that this is the correct version?
> 
>  Best,
>  D.
> 
> 
>  On Mon, Apr 17, 2023 at 9:24 AM Panagiotis Garefalakis <
> >> pga...@apache.org
> >
>  wrote:
> 
> > Hello everyone,
> >
> > I want to start the vote for FLIP-304: Pluggable Failure Enrichers
> [1]
>  --
> > discussed as part of [2].
> >
> > FLIP-304 introduces a pluggable interface allowing users to add
> custom
> > logic and enrich failures with custom metadata labels.
> >
> > The vote will last for at least 72 hours (Thursday, 20th of April
> >> 2023,
> > 12:30 PST) unless there is an objection or insufficient votes.
> >
> > [1]
> >
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%3A+Pluggable+Failure+Enrichers
> > [2] https://lists.apache.org/thread/zs9n9p8d7tyvnq4yyxhc8zvq1k2c1hvs
> >
> >
> > Cheers,
> > Panagiotis
> >
> 
> >>>
> >>
>
>


Re: Flink 1.15 Stabilisation Sync

2022-04-09 Thread Roman Khachatryan
Hi everyone,

FLINK-26985 was just merged.
Sorry for the long delay.

Regards,
Roman


On Tue, Apr 5, 2022 at 11:52 AM Yuan Mei  wrote:
>
> FLINK-26985 was discovered just before last weekend.
>
> We will get it resolved first thing after the holiday (tomorrow).
>
> Best
> Yuan
>
> On Tue, Apr 5, 2022 at 5:37 PM Yun Gao  wrote:
>
> > Hi Robert,
> >
> > Very sorry for the long delay before the rc1 could be published.
> >
> > For the open critical issues, I previously checked with the owners of the
> > issues
> > that they are optional to the release, thus the only blocker issue is the
> > FLINK-26985.
> >
> > Besides to the blocker issues we are waiting for the release blog to be
> > done. As a whole,
> > if there is no new blocker issues that must be fixed, both of the current
> > blockers are expected
> > to be finished as soon as possible inside this week and we'll create the
> > rc1 immediately
> > after that. We'll also put more efforts to track and address the remaining
> > blockers.
> >
> > Best,
> > Yun Gao
> >
> >
> > --
> > From:Robert Metzger 
> > Send Time:2022 Apr. 5 (Tue.) 13:34
> > To:dev 
> > Subject:Re: Flink 1.15 Stabilisation Sync
> >
> > Thanks a lot for the update.
> >
> > From the burndown board [1] it seems that there's only one blocker left,
> > which has already a fix in review [2].
> > Are we also planning to address the open Critical tickets before opening
> > the first voting release candidate?
> > When do you expect 1.15 to be out?
> >
> > [1]
> >
> > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=505&view=detail&selectedIssue=FLINK-26985
> > [2] https://issues.apache.org/jira/browse/FLINK-26985
> >
> >
> > On Mon, Mar 28, 2022 at 11:18 PM Johannes Moser  wrote:
> >
> > > Dear Community,
> > >
> > > We are fairly on track with stabilising the 1.15 release, that’s why Yun
> > > Gao and me think we don’t need the sync anymore.
> > >
> > > So we skip it for now, starting from tomorrow. If the situation might
> > > change. We will let you know.
> > >
> > > Best,
> > > Joe
> >
> >


Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem class

2022-06-30 Thread Roman Khachatryan
Hi,

Thanks for the proposal Yun, I think that's a good idea and it could
solve the issue you mentioned (FLINK-26590) in many cases (though not all,
depending on deletion speed; but in practice it may be enough).

Having a separate interface (BulkDeletingFileSystem) would probably help in
incremental implementation of the feature (i.e. FS by FS, rather than all
at once). Although the same can be achieved by adding supportsBulkDelete().

Regarding BulkFileDeleter, I think it's required in some form, because
grouping must be done before calling FS.delete(), even if it accepts a
collection.

Have you considered limiting the batch sizes for deletions?
For example, S3 has a limit of 1000 [1], but the SDK handles it
automatically, IIUC.
If we don't rely on this handling, and implement our own, the batches could
be also deleted in parallel. This can be an initial step, from which all
the file systems would benefit, even those without bulk-delete support.

[1]
https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html

Regards,
Roman


On Thu, Jun 30, 2022 at 5:10 PM Piotr Nowojski  wrote:

> Hi,
>
> Yes, I know that you can not use recursive deletes for
> incremental checkpoints and I didn't suggest it anywhere. I just pointed
> out that I would expect multi/bulk deletes to supersede the recursive
> deletes feature assuming good underlying implementation.
> Also I'm not surprised that multi deletes can be faster. I would
> expect/hope for that. I've just raised a point that they don't have to be.
> It depends on the underlying file system. However in contrast to the
> recursive deletes, with multi deletes I wouldn't expect multi delete to be
> potentially slower.
>
> Re the Dawid's PoC. I'm not sure/I don't remember why he proposed
> `BulkDeletingFileSystem` over adding a default method to the FileSystem
> interface. But it seems to me like a minor point. The majority of Dawid's
> PR is about `BulkFileDeleter` interface, not `BulkDeletingFileSystem`, so
> about how to use the bulk deletes inside Flink, not how to implement it on
> the FileSystem side. Do you maybe have a concrete design proposal for this
> feature?
>
> Best,
> Piotrek
>
> czw., 30 cze 2022 o 15:12 Yun Tang  napisał(a):
>
> > Hi Piotr,
> >
> > As I said in the original email, you cannot delete folders recursively
> for
> > incremental checkpoints. And If you take a close look at the original
> > email, I have shared the experimental results, which proved 29x
> improvement:
> > "A simple experiment shows that deleting 1000 objects with each 5MB size,
> > will cost 39494ms with for-loop single delete operations, and the result
> > will drop to 1347ms if using multi-delete API in Tencent Cloud."
> >
> > I think I can leverage some ideas from Dawid's work. And as I said, I
> > would introduce the multi-delete API to the original FileSystem class
> > instead of introducing another BulkDeletingFileSystem, which makes the
> file
> > system abstraction closer to the modern cloud-based environment.
> >
> > Best
> > Yun Tang
> > 
> > From: Piotr Nowojski 
> > Sent: Thursday, June 30, 2022 18:25
> > To: dev ; Dawid Wysakowicz  >
> > Subject: Re: [DISCUSS] Introduce multi delete API to Flink's FileSystem
> > class
> >
> > Hi,
> >
> > I presume this would mostly supersede the recursive deletes [1]? I
> remember
> > an argument that the recursive deletes were not obviously better, even if
> > the underlying FS was supporting it. I'm not saying that this would have
> > been a counter argument against this effort, since every FileSystem could
> > decide on its own whether to use the multi delete call or not. But I
> think
> > at the very least it should be benchmarked/compared whether implementing
> it
> > for a particular FS makes sense or not.
> >
> > Also there seems to be some similar (abandoned?) effort from Dawid, with
> > named bulk deletes, with "BulkDeletingFileSystem"? [2] Isn't this
> basically
> > the same thing that you are proposing Yun Tang?
> >
> > Best,
> > Piotrek
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-13856
> > [2]
> >
> >
> https://issues.apache.org/jira/browse/FLINK-13856?focusedCommentId=17481712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17481712
> >
> > czw., 30 cze 2022 o 11:45 Zakelly Lan 
> napisał(a):
> >
> > > Hi Yun,
> > >
> > > Thanks for bringing this into discussion.
> > > I'm +1 to this idea.
> > > And IIUC, Flink implements the OSS and S3 filesystem based on the
> hadoop
> > > filesystem interface, which does not provide the multi-delete API, it
> may
> > > take some effort to implement this.
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Thu, Jun 30, 2022 at 5:36 PM Martijn Visser <
> martijnvis...@apache.org
> > >
> > > wrote:
> > >
> > > > Hi Yun Tang,
> > > >
> > > > +1 for addressing this problem and your approach.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > Op do 30 jun. 2022 om 11:12 schreef Feifan W

Re: [DISCUSS] FLIP-386: Support adding custom metrics in Recovery Spans

2023-11-21 Thread Roman Khachatryan
Thanks for clarifying, I see your points (although reporting metrics as
spans still seems counter intuitive to me).

As for the aggregation, I'm concerned that it might be unnecessarily
ambiguous: where the aggregation is performed (JM/TM);  across what
(tasks/time); and which aggregation should be used.

How about dropping it from the API and always using min, max, sum, avg? I
think we're interested in these aggregations for all the metrics, and there
is no penalty for reporting all of them because it's only for
initialization.

Regards,
Roman

On Mon, Nov 20, 2023, 8:42 AM Piotr Nowojski  wrote:

> Hi Roman!
>
> > 1. why the existing MetricGroup interface can't be used? It already had
> > methods to add metrics and spans ...
>
> That's because of the need to:
> a) associate the spans to specifically Job's initialisation
> b) we need to logically aggregate the span's attributes across subtasks.
>
> `MetricGroup` doesn't have such capabilities and it's too generic an
> interface to introduce things like that IMO.
>
> Additionally for metrics:
> c) reporting initialization measurements as metrics is a flawed concept as
> described in the FLIP's-384 motivation
> Additionally for spans:
> d) as discussed in the FLIP's-384 thread, we don't want to report separate
> spans on the TMs. At least not right now
>
> Also having a specialized, dedicated for initialization metrics class to
> collect those numbers, makes the interfaces
> more lean and more specialized.
>
> > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> > the backend report them as spans?
>
> As discussed in the FLIP's-384, initially we don't want to report spans on
> TMs. Later, optionally reporting
> individual subtask's checkpoint/recovery spans on the JM looks like a
> logical follow up.
>
> > 3. How is the implementation supposed to infer that some metric is a part
> > of initialization (and make the corresponding RPC to JM?). Should the
> > interfaces be more explicit about that?
>
> This FLIP proposes [1] to add
> `CustomInitializationMetrics
> KeyedStateBackendParameters#getCustomInitializationMetrics()`
> accessor to the `KeyedStateBackendParameters` argument that's passed to
> `createKeyedStateBackend(...)`
> method. That's pretty explicit I would say :)
>
> > 4. What do you think about using histogram or percentiles instead of
> > min/max/sum/avg? That would be more informative
>
> I would prefer to start with the simplest min/max/sum/avg, and let's see in
> which direction (if any) we need to evolve
> that. Alternative to percentiles is previously mentioned to report
> separately each subtask's initialisation/checkpointing span.
>
> Best,
> Piotrek
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans#FLIP386:SupportaddingcustommetricsinRecoverySpans-PublicInterfaces
>
> czw., 16 lis 2023 o 15:45 Roman Khachatryan  napisał(a):
>
> > Thanks for the proposal,
> >
> > Can you please explain:
> > 1. why the existing MetricGroup interface can't be used? It already had
> > methods to add metrics and spans ...
> >
> > 2. IIUC, based on these numbers, we're going to report span(s). Shouldn't
> > the backend report them as spans?
> >
> > 3. How is the implementation supposed to infer that some metric is a part
> > of initialization (and make the corresponding RPC to JM?). Should the
> > interfaces be more explicit about that?
> >
> > 4. What do you think about using histogram or percentiles instead of
> > min/max/sum/avg? That would be more informative
> >
> > I like the idea of introducing parameter objects for backend creation.
> >
> > Regards,
> > Roman
> >
> > On Tue, Nov 7, 2023, 1:20 PM Piotr Nowojski 
> wrote:
> >
> > > (Fixing topic)
> > >
> > > wt., 7 lis 2023 o 09:40 Piotr Nowojski 
> > napisał(a):
> > >
> > > > Hi all!
> > > >
> > > > I would like to start a discussion on a follow up of FLIP-384:
> > Introduce
> > > > TraceReporter and use it to create checkpointing and recovery traces
> > [1]:
> > > >
> > > > *FLIP-386: Support adding custom metrics in Recovery Spans [2]*
> > > >
> > > > This FLIP adds a functionality that will allow state backends to
> attach
> > > > custom metrics to the recovery/initialization traces. This requires
> > > changes
> > > > to the `@PublicEvolving` `StateBackend` A

Re: [VOTE] FLIP-390: Support System out and err to be redirected to LOG or discarded

2023-11-22 Thread Roman Khachatryan
+1 (binding)


Thanks for the proposal

Regards,
Roman

On Wed, Nov 22, 2023, 10:08 AM Piotr Nowojski 
wrote:

> Thanks Rui!
>
> +1 (binding)
>
> Best,
> Piotrek
>
> śr., 22 lis 2023 o 08:05 Hangxiang Yu  napisał(a):
>
> > +1 (binding)
> > Thanks for your efforts!
> >
> > On Mon, Nov 20, 2023 at 11:53 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > Thank you to everyone for the feedback on FLIP-390: Support
> > > System out and err to be redirected to LOG or discarded[1]
> > > which has been discussed in this thread [2].
> > >
> > > I would like to start a vote for it. The vote will be open for at least
> > 72
> > > hours unless there is an objection or not enough votes.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/4guZE
> > > [2] https://lists.apache.org/thread/47pdjggh0q0tdkq0cwt6y5o2o8wrl9jl
> > >
> > > Best,
> > > Rui
> > >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
>


Re: [VOTE] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

2023-11-22 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman

On Wed, Nov 22, 2023, 7:08 AM Zakelly Lan  wrote:

> +1(non-binding)
>
> Best,
> Zakelly
>
> On Wed, Nov 22, 2023 at 3:04 PM Hangxiang Yu  wrote:
>
> > +1 (binding)
> > Thanks for driving this again!
> >
> > On Wed, Nov 22, 2023 at 10:30 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > +1(binding)
> > >
> > > Best,
> > > Rui
> > >
> > > On Wed, Nov 22, 2023 at 6:43 AM Jing Ge 
> > > wrote:
> > >
> > > > +1(binding) Thanks!
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Tue, Nov 21, 2023 at 6:17 PM Piotr Nowojski  >
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I'd like to start a vote on the FLIP-384: Introduce TraceReporter
> and
> > > use
> > > > > it to create checkpointing and recovery traces [1]. The discussion
> > > thread
> > > > > is here [2].
> > > > >
> > > > > The vote will be open for at least 72 hours unless there is an
> > > objection
> > > > or
> > > > > not enough votes.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/x/TguZE
> > > > > [2]
> https://lists.apache.org/thread/7lql5f5q1np68fw1wc9trq3d9l2ox8f4
> > > > >
> > > > >
> > > > > Best,
> > > > > Piotrek
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
>


Re: [VOTE] FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter

2023-11-22 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman

On Wed, Nov 22, 2023, 7:30 AM Hangxiang Yu  wrote:

> +1(binding)
>
> On Wed, Nov 22, 2023 at 10:29 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > +1(binding)
> >
> > Best,
> > Rui
> >
> > On Wed, Nov 22, 2023 at 1:20 AM Piotr Nowojski 
> > wrote:
> >
> > > Hi All,
> > >
> > > I'd like to start a vote on the FLIP-385: Add
> OpenTelemetryTraceReporter
> > > and OpenTelemetryMetricReporter [1]. The discussion thread is here [2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> objection
> > or
> > > not enough votes.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/UAuZE
> > > [2] https://lists.apache.org/thread/1rqp8czz8wnplpzgn8m4qmzvf14lyx0k
> > >
> > >
> > > Best,
> > > Piotrek
> > >
> >
>
>
> --
> Best,
> Hangxiang.
>


Re: [VOTE] FLIP-386: Support adding custom metrics in Recovery Spans

2023-11-23 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Wed, Nov 22, 2023 at 12:55 PM Rui Fan <1996fan...@gmail.com> wrote:

> +1(binding)
>
> Thanks for driving this  proposal!
>
> Best,
> Rui
>
> On Wed, Nov 22, 2023 at 7:44 PM Piotr Nowojski 
> wrote:
>
> > Hi All,
> >
> > I'd like to start a vote on the FLIP-386: Support adding custom metrics
> in
> > Recovery Spans [1]. The discussion thread is here [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > not enough votes.
> >
> > [1] https://cwiki.apache.org/confluence/x/VAuZE
> > [2] https://lists.apache.org/thread/zt4ykyhv6cco83j9hjngn52b1oprj1tv
> >
>


Re: [VOTE] FLIP-424: Asynchronous State APIs

2024-03-30 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Fri, Mar 29, 2024 at 7:01 AM Xintong Song  wrote:

> +1 (binding)
>
> Best,
>
> Xintong
>
>
>
> On Fri, Mar 29, 2024 at 12:51 PM Yuepeng Pan 
> wrote:
>
> > +1(non-binding)
> >
> > Best,
> > Yuepeng Pan
> >
> >
> > On 2024/03/29 03:03:53 Yunfeng Zhou wrote:
> > > +1 (non-binding)
> > >
> > > Best,
> > > Yunfeng
> > >
> > > On Wed, Mar 27, 2024 at 6:23 PM Zakelly Lan 
> > wrote:
> > > >
> > > > Hi devs,
> > > >
> > > > I'd like to start a vote on the FLIP-424: Asynchronous State APIs
> [1].
> > The
> > > > discussion thread is here [2].
> > > >
> > > > The vote will be open for at least 72 hours unless there is an
> > objection or
> > > > insufficient votes.
> > > >
> > > > [1] https://cwiki.apache.org/confluence/x/SYp3EQ
> > > > [2] https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
> > > >
> > > >
> > > > Best,
> > > > Zakelly
> > >
> >
>


Re: [VOTE] FLIP-425: Asynchronous Execution Model

2024-03-30 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Fri, Mar 29, 2024 at 8:08 AM yue ma  wrote:

> +1 (non-binding)
>
> Yanfei Lei  于2024年3月27日周三 18:28写道:
>
> > Hi everyone,
> >
> > Thanks for all the feedback about the FLIP-425: Asynchronous Execution
> > Model [1]. The discussion thread is here [2].
> >
> > The vote will be open for at least 72 hours unless there is an
> > objection or insufficient votes.
> >
> > [1] https://cwiki.apache.org/confluence/x/S4p3EQ
> > [2] https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
> >
> > Best regards,
> > Yanfei
> >
>
>
> --
> Best,
> Yue
>


[DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-08 Thread Roman Khachatryan
Hi everyone,

I'd like to discuss sharing RocksDB memory across slots as proposed in
FLINK-29928 [1].

Since 1.10 / FLINK-7289 [2], it is possible to:
- share these objects among RocksDB instances of the same slot
- bound the total memory usage by all RocksDB instances of a TM

However, the memory is divided between the slots equally (unless using
fine-grained resource control). This is sub-optimal if some slots contain
more memory intensive tasks than the others.
Using fine-grained resource control is also often not an option because the
workload might not be known in advance.

The proposal is to widen the scope of sharing memory to TM, so that it can
be shared across all RocksDB instances of that TM. That would reduce the
overall memory consumption in exchange for resource isolation.

Please see FLINK-29928 [1] for more details.

Looking forward to feedback on that proposal.

[1]
https://issues.apache.org/jira/browse/FLINK-29928
[2]
https://issues.apache.org/jira/browse/FLINK-7289

Regards,
Roman


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-09 Thread Roman Khachatryan
Hi Yanfei,

Thanks, good questions

> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
Yes, shared memory is only for the state backend currently;
If no job uses it then it will be wasted.
Session cluster can not validate this configuration
because the job configuration is not known in advance.

> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
memory
> and will compete with each other" in the example, how is the memory size
of
> unmanaged part calculated?
It's calculated the same way as managed memory size currently,
i.e. taskmanager.memory.managed.size *
taskmanager.memory.managed.shared-fraction
Separate parameters for unmanaged memory would be more clear.
However, I doubt that this configuration would ever be used (I listed it
just for completeness).
So I'm not sure whether adding them would be justified.
WDYT?

> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?
Yes, for other resources fine-grained-resource-management should work.

> And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
I'm also worried about having too many options, but I don't see any better
alternative.
The existing users definitely shouldn't be affected,
so there must be at least feature toggle ("shared-fraction").
"share-scope" could potentially be replaced by some inference logic,
but having it explicit seems less error-prone.

Regards,
Roman


On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:

> Hi Roman,
> Thanks for the proposal, this allows State Backend to make better use of
> memory.
>
> After reading the ticket, I'm curious about some points:
>
> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
> and will compete with each other" in the example, how is the memory size of
> unmanaged part calculated?
> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?  And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
>
> Regards,
> Yanfei
>
> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>
> > Hi everyone,
> >
> > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > FLINK-29928 [1].
> >
> > Since 1.10 / FLINK-7289 [2], it is possible to:
> > - share these objects among RocksDB instances of the same slot
> > - bound the total memory usage by all RocksDB instances of a TM
> >
> > However, the memory is divided between the slots equally (unless using
> > fine-grained resource control). This is sub-optimal if some slots contain
> > more memory intensive tasks than the others.
> > Using fine-grained resource control is also often not an option because
> the
> > workload might not be known in advance.
> >
> > The proposal is to widen the scope of sharing memory to TM, so that it
> can
> > be shared across all RocksDB instances of that TM. That would reduce the
> > overall memory consumption in exchange for resource isolation.
> >
> > Please see FLINK-29928 [1] for more details.
> >
> > Looking forward to feedback on that proposal.
> >
> > [1]
> > https://issues.apache.org/jira/browse/FLINK-29928
> > [2]
> > https://issues.apache.org/jira/browse/FLINK-7289
> >
> > Regards,
> > Roman
> >
>


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-11 Thread Roman Khachatryan
Hi John, Yun,

Thank you for your feedback

@John

> It seems like operators would either choose isolation for the cluster’s
jobs
> or they would want to share the memory between jobs.
> I’m not sure I see the motivation to reserve only part of the memory for
sharing
> and allowing jobs to choose whether they will share or be isolated.

I see two related questions here:

1) Whether to allow mixed workloads within the same cluster.
I agree that most likely all the jobs will have the same "sharing"
requirement.
So we can drop "state.backend.memory.share-scope" from the proposal.

2) Whether to allow different memory consumers to use shared or exclusive
memory.
Currently, only RocksDB is proposed to use shared memory. For python, it's
non-trivial because it is job-specific.
So we have to partition managed memory into shared/exclusive and therefore
can NOT replace "taskmanager.memory.managed.shared-fraction" with some
boolean flag.

I think your question was about (1), just wanted to clarify why the
shared-fraction is needed.

@Yun

> I am just curious whether this could really bring benefits to our users
with such complex configuration logic.
I agree, and configuration complexity seems a common concern.
I hope that removing "state.backend.memory.share-scope" (as proposed above)
reduces the complexity.
Please share any ideas of how to simplify it further.

> Could you share some real experimental results?
I did an experiment to verify that the approach is feasible,
i.e. multilple jobs can share the same memory/block cache.
But I guess that's not what you mean here? Do you have any experiments in
mind?

> BTW, as talked before, I am not sure whether different lifecycles of
RocksDB state-backends
> would affect the memory usage of block cache & write buffer manager in
RocksDB.
> Currently, all instances would start and destroy nearly simultaneously,
> this would change after we introduce this feature with jobs running at
different scheduler times.
IIUC, the concern is that closing a RocksDB instance might close the
BlockCache.
I checked that manually and it seems to work as expected.
And I think that would contradict the sharing concept, as described in the
documentation [1].

[1]
https://github.com/facebook/rocksdb/wiki/Block-Cache

Regards,
Roman


On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:

> Hi Roman,
> Thanks for the proposal, this allows State Backend to make better use of
> memory.
>
> After reading the ticket, I'm curious about some points:
>
> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
> and will compete with each other" in the example, how is the memory size of
> unmanaged part calculated?
> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?  And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
>
> Regards,
> Yanfei
>
> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>
> > Hi everyone,
> >
> > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > FLINK-29928 [1].
> >
> > Since 1.10 / FLINK-7289 [2], it is possible to:
> > - share these objects among RocksDB instances of the same slot
> > - bound the total memory usage by all RocksDB instances of a TM
> >
> > However, the memory is divided between the slots equally (unless using
> > fine-grained resource control). This is sub-optimal if some slots contain
> > more memory intensive tasks than the others.
> > Using fine-grained resource control is also often not an option because
> the
> > workload might not be known in advance.
> >
> > The proposal is to widen the scope of sharing memory to TM, so that it
> can
> > be shared across all RocksDB instances of that TM. That would reduce the
> > overall memory consumption in exchange for resource isolation.
> >
> > Please see FLINK-29928 [1] for more details.
> >
> > Looking forward to feedback on that proposal.
> >
> > [1]
> > https://issues.apache.org/jira/browse/FLINK-29928
> > [2]
> > https://issues.apache.org/jira/browse/FLINK-7289
> >
> > Regards,
> > Roman
> >
>


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-16 Thread Roman Khachatryan
gt; will break the isolation. It's just a matter of whether it's managed
> memory
> > or not.
> > Do you see any reasons why unmanaged memory can be shared, and managed
> > memory can not?
> >
> > > 2. It should never be wasted (unless there's nothing in the job that
> > needs
> > > managed memory)
> > If I understand correctly, the managed memory can already be wasted
> because
> > it is divided evenly between slots, regardless of the existence of its
> > consumers in a particular slot.
> > And in general, even if every slot has RocksDB / python, it's not
> > guaranteed equal consumption.
> > So this property would rather be fixed in the current proposal.
> >
> > > In addition, it further complicates configuration / computation logics
> of
> > > managed memory.
> > I think having multiple options overriding each other increases the
> > complexity for the user. As for the computation, I think it's desirable
> to
> > let Flink do it rather than users.
> >
> > Both approaches need some help from TM for:
> > - storing the shared resources (static field in a class might be too
> > dangerous because if the backend is loaded by the user-class-loader then
> > memory will leak silently).
> > - reading the configuration
> >
> > Regards,
> > Roman
> >
> >
> > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song 
> > wrote:
> >
> > > I like the idea of sharing RocksDB memory across slots. However, I'm
> > quite
> > > concerned by the current proposed approach.
> > >
> > > The proposed changes break several good properties that we designed for
> > > managed memory.
> > > 1. It's isolated across slots
> > > 2. It should never be wasted (unless there's nothing in the job that
> > needs
> > > managed memory)
> > > In addition, it further complicates configuration / computation logics
> of
> > > managed memory.
> > >
> > > As an alternative, I'd suggest introducing a variant of
> > > RocksDBStateBackend, that shares memory across slots and does not use
> > > managed memory. This basically means the shared memory is not
> considered
> > as
> > > part of managed memory. For users of this new feature, they would need
> to
> > > configure how much memory the variant state backend should use, and
> > > probably also a larger framework-off-heap / jvm-overhead memory. The
> > latter
> > > might require a bit extra user effort compared to the current approach,
> > but
> > > would avoid significant complexity in the managed memory configuration
> > and
> > > calculation logics which affects broader users.
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan 
> > > wrote:
> > >
> > > > Hi John, Yun,
> > > >
> > > > Thank you for your feedback
> > > >
> > > > @John
> > > >
> > > > > It seems like operators would either choose isolation for the
> > cluster’s
> > > > jobs
> > > > > or they would want to share the memory between jobs.
> > > > > I’m not sure I see the motivation to reserve only part of the
> memory
> > > for
> > > > sharing
> > > > > and allowing jobs to choose whether they will share or be isolated.
> > > >
> > > > I see two related questions here:
> > > >
> > > > 1) Whether to allow mixed workloads within the same cluster.
> > > > I agree that most likely all the jobs will have the same "sharing"
> > > > requirement.
> > > > So we can drop "state.backend.memory.share-scope" from the proposal.
> > > >
> > > > 2) Whether to allow different memory consumers to use shared or
> > exclusive
> > > > memory.
> > > > Currently, only RocksDB is proposed to use shared memory. For python,
> > > it's
> > > > non-trivial because it is job-specific.
> > > > So we have to partition managed memory into shared/exclusive and
> > > therefore
> > > > can NOT replace "taskmanager.memory.managed.shared-fraction" with
> some
> > > > boolean flag.
> > > >
> > > > I think your question was about (1), just wanted to clarify why the
> > > > shared-fraction is nee

Re: [DISCUSS] FLIP-266: Simplify network memory configurations for TaskManager

2022-12-27 Thread Roman Khachatryan
Hi everyone,

Thanks for the proposal and the discussion.

I couldn't find much details on how exactly the values of
ExclusiveBuffersPerChannel and FloatingBuffersPerGate are calculated.
I guess that
- the threshold evaluation is done on JM
- floating buffers calculation is done on TM based on the current memory
available; so it is not taking into account any future tasks submitted for
that (or other) job
Is that correct?

If so, I see the following potential issues:

1. Each (sub)task might have different values because the actual
available memory might be different. E.g. some tasks might use exclusive
buffers and others only floating. That could lead to significant skew
in processing speed, and in turn to issues with checkpoints and watermarks.

2. Re-deployment of a task (e.g. on job failure) might lead to a completely
different memory configuration. That, coupled with different values per
subtask and operator, makes the performance analysis more difficult.

(Regardless of whether it's done on TM or JM):
3. Each gate requires at least one buffer [1]. So, in case when no memory
is available, TM will throw an Allocation timeout exception instead of
Insufficient buffers exception immediately. A delay here (allocation
timeout) seems like a regression.
Besides that, the regression depends on how much memory is actually
available and how much it is contended, doesn't it?
Should there still be a lower threshold of available memory, below which
the job (task) isn't accepted?
4. The same threshold for all types of shuffles will likely result in using
exclusive buffers
for point-wise connections and floating buffers for all-to-all ones. I'm
not sure if that's always optimal. It would be great to have experimental
results for jobs with different exchange types, WDYT?

[1]
https://issues.apache.org/jira/browse/FLINK-24035

Regards,
Roman


On Tue, Dec 27, 2022 at 4:12 AM Yuxin Tan  wrote:

> Hi, Weihua
>
> Thanks for your suggestions.
>
> > 1. How about reducing ExclusiveBuffersPerChannel to 1 first when the
> total buffer is not enough?
>
> I think it's a good idea. Will try and check the results in PoC. Before all
> read buffers use floating buffers, I will try to use
> (ExclusiveBuffersPerChannel - i)
> buffers per channel first. For example, if the user has configured
> ExclusiveBuffersPerChannel to 4, it will check whether all read buffers
> are sufficient from 4 to 1. Only when ExclusiveBuffersPerChannel of
> all channels is 1 and all read buffers are insufficient, all read buffers
> will use floating buffers.
> If the test results prove better, the FLIP will use this method.
>
> > 2. Do we really need to change the default value of
> 'taskmanager.memory.network.max'?
>
> Changing taskmanager.memory.network.max will indeed affect some
> users, but the user only is affected when the 3 conditions are fulfilled.
> 1) Flink total TM memory is larger than 10g (because the network memory
> ratio is 0.1).
> 2) taskmanager.memory.network.max was not initially configured.
> 3) Other memory, such as managed memory or heap memory, is insufficient.
> I think the number of jobs fulfilling the conditions is small because when
> TM
> uses such a large amount of memory, the network memory requirement may
> also be large. And when encountering the issue, the rollback method is very
> simple,
> configuring taskmanager.memory.network.max as 1g or other values.
> In addition, the reason for modifying the default value is to simplify the
> network
> configurations in most scenarios. This change does affect a few usage
> scenarios,
> but we should admit that setting the default to any value may not meet
> the requirements of all scenarios.
>
> Best,
> Yuxin
>
>
> Weihua Hu  于2022年12月26日周一 20:35写道:
>
> > Hi Yuxin,
> > Thanks for the proposal.
> >
> > "Insufficient number of network buffers" exceptions also bother us. It's
> > too hard for users to figure out
> > how much network buffer they really need. It relates to partitioner type,
> > parallelism, slots per taskmanager.
> >
> > Since streaming jobs are our primary scenario, I have some questions
> about
> > streaming jobs.
> >
> > 1. In this FLIP, all read buffers will use floating buffers when the
> total
> > buffer is more than
> > 'taskmanager.memory.network.read-required-buffer.max'. Competition in
> > buffer allocation led to preference regression.
> > How about reducing ExclusiveBuffersPerChannel to 1 first when the total
> > buffer is not enough?
> > Will this reduce performance regression in streaming?
> >
> > 2. Changing taskmanager.memory.network.max will affect user migration
> from
> > the lower version.
> > IMO, network buffer size should not increase with total memory,
> especially
> > for streaming jobs with application mode.
> > For example, some ETL jobs with rescale partitioner only require a few
> > network buffers.
> > And we already have 'taskmanager.memory.network.read-required-buffer.max'
> > to control maximum read network buffer usage.
> > Do we really 

Re: [DISCUSS] FLIP-266: Simplify network memory configurations for TaskManager

2022-12-28 Thread Roman Khachatryan
Thanks for your reply Yuxin,

> ExclusiveBuffersPerChannel and FloatingBuffersPerGate are obtained from
> configurations, which are not calculated. I have described them in the
FLIP
> motivation section.

The motivation section says about floating buffers:
> FloatingBuffersPerGate is within the range of
[numFloatingBufferThreashold, ExclusiveBuffersPerChannel * numChannels +
DefaultFloatingBuffersPerGate] ...
So my question is what value exactly in this range will it have and how and
where will it be computed?

As for the ExclusiveBuffersPerChannel, there was a proposal in the thread
to calculate it dynamically (by linear search
from taskmanager.network.memory.buffers-per-channel down to 0).

Also, if the two configuration options are still in use, why does the FLIP
propose to deprecate them?

Besides that, wouldn't it be more clear to separate motivation from the
proposed changes?

Regards,
Roman


On Wed, Dec 28, 2022 at 12:19 PM JasonLee <17610775...@163.com> wrote:

> Hi Yuxin
>
>
> Thanks for the proposal, big + 1 for this FLIP.
>
>
>
> It is difficult for users to calculate the size of network memory. If the
> setting is too small, the task cannot be started. If the setting is too
> large, there may be a waste of resources. As far as possible, Flink
> framework can automatically set a reasonable value, but I have a small
> problem. network memory is not only related to the parallelism of the task,
> but also to the complexity of the task DAG. The more complex a DAG is,
> shuffle write and shuffle read require larger buffers. How can we determine
> how many RS and IG a DAG has?
>
>
>
> Best
> JasonLee
>
>
>  Replied Message 
> | From | Yuxin Tan |
> | Date | 12/28/2022 18:29 |
> | To |  |
> | Subject | Re: [DISCUSS] FLIP-266: Simplify network memory configurations
> for TaskManager |
> Hi, Roman
>
> Thanks for the replay.
>
> ExclusiveBuffersPerChannel and FloatingBuffersPerGate are obtained from
> configurations, which are not calculated. I have described them in the FLIP
> motivation section.
>
> 3. Each gate requires at least one buffer...
> The timeout exception occurs when the ExclusiveBuffersPerChannel
> can not be requested from NetworkBufferPool, which is not caused by the
> change of this Flip. In addition, if  we have set the
> ExclusiveBuffersPerChannel
> to 0 when using floating buffers, which can also decrease the probability
> of
> this exception.
>
> 4. It would be great to have experimental results for jobs with different
> exchange types.
> Thanks for the suggestion. I have a test about different exchange types,
> forward
> and rescale, and the results show no differences from the all-to-all type,
> which
> is also understandable, because the network memory usage is calculated
> with numChannels, independent of the edge type.
>
> Best,
> Yuxin
>
>
> Roman Khachatryan  于2022年12月28日周三 05:27写道:
>
> Hi everyone,
>
> Thanks for the proposal and the discussion.
>
> I couldn't find much details on how exactly the values of
> ExclusiveBuffersPerChannel and FloatingBuffersPerGate are calculated.
> I guess that
> - the threshold evaluation is done on JM
> - floating buffers calculation is done on TM based on the current memory
> available; so it is not taking into account any future tasks submitted for
> that (or other) job
> Is that correct?
>
> If so, I see the following potential issues:
>
> 1. Each (sub)task might have different values because the actual
> available memory might be different. E.g. some tasks might use exclusive
> buffers and others only floating. That could lead to significant skew
> in processing speed, and in turn to issues with checkpoints and watermarks.
>
> 2. Re-deployment of a task (e.g. on job failure) might lead to a completely
> different memory configuration. That, coupled with different values per
> subtask and operator, makes the performance analysis more difficult.
>
> (Regardless of whether it's done on TM or JM):
> 3. Each gate requires at least one buffer [1]. So, in case when no memory
> is available, TM will throw an Allocation timeout exception instead of
> Insufficient buffers exception immediately. A delay here (allocation
> timeout) seems like a regression.
> Besides that, the regression depends on how much memory is actually
> available and how much it is contended, doesn't it?
> Should there still be a lower threshold of available memory, below which
> the job (task) isn't accepted?
> 4. The same threshold for all types of shuffles will likely result in using
> exclusive buffers
> for point-wise connections and floating buffers for all-to-all ones. I'm
> not sure if that's always optimal. It would

Re: [DISCUSS] FLIP-447: Upgrade FRocksDB from 6.20.3 to 8.10.0

2024-04-22 Thread Roman Khachatryan
Hi,

Thanks for writing the proposal and preparing the upgrade.

FRocksDB  definitely needs to be kept in sync with the upstream and the new
APIs are necessary for faster rescaling.
We're already using a similar version internally.

I reviewed the FLIP and it looks good to me (disclaimer: I took part in
some steps of this effort).


Regards,
Roman

On Mon, Apr 22, 2024, 08:11 yue ma  wrote:

> Hi Flink devs,
>
> I would like to start a discussion on FLIP-447: Upgrade FRocksDB from
> 6.20.3 to 8.10.0
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-447%3A+Upgrade+FRocksDB+from+6.20.3++to+8.10.0
>
> This FLIP proposes upgrading the version of FRocksDB in the Flink Project
> from 6.20.3 to 8.10.0.
> The FLIP mainly introduces the main benefits of upgrading FRocksDB,
> including the use of IngestDB which can improve Rescaling performance by
> more than 10 times in certain scenarios, as well as other potential
> optimization points such as async_io, blob db, and tiered storage.The
> FLIP also presented test results based on RocksDB 8.10, including
> StateBenchmark and Nexmark tests.
> Overall, upgrading FRocksDB may result in a small regression of write
> performance( which is a very small part of the overall overhead), but it
> can bring many important performance benefits.
> So we hope to upgrade the version of FRocksDB through this FLIP.
>
> Looking forward to everyone's feedback and suggestions. Thank you!
> --
> Best regards,
> Yue
>


Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan

2024-04-26 Thread Roman Khachatryan
Congrats, well deserved!

Regards,
Roman


On Thu, Apr 18, 2024 at 9:06 AM xiangyu feng  wrote:

> Congratulations, Zakelly!
>
>
> Regards,
> Xiangyu Feng
>
> yh z  于2024年4月18日周四 14:27写道:
>
> > Congratulations Zakelly!
> >
> > Best regards,
> > Yunhong (swuferhong)
> >
> > gongzhongqiang  于2024年4月17日周三 21:26写道:
> >
> > > Congratulations, Zakelly!
> > >
> > >
> > > Best,
> > > Zhongqiang Gong
> > >
> > > Yuan Mei  于2024年4月15日周一 10:51写道:
> > >
> > > > Hi everyone,
> > > >
> > > > On behalf of the PMC, I'm happy to let you know that Zakelly Lan has
> > > become
> > > > a new Flink Committer!
> > > >
> > > > Zakelly has been continuously contributing to the Flink project since
> > > 2020,
> > > > with a focus area on Checkpointing, State as well as frocksdb (the
> > > default
> > > > on-disk state db).
> > > >
> > > > He leads several FLIPs to improve checkpoints and state APIs,
> including
> > > > File Merging for Checkpoints and configuration/API reorganizations.
> He
> > is
> > > > also one of the main contributors to the recent efforts of
> > "disaggregated
> > > > state management for Flink 2.0" and drives the entire discussion in
> the
> > > > mailing thread, demonstrating outstanding technical depth and breadth
> > of
> > > > knowledge.
> > > >
> > > > Beyond his technical contributions, Zakelly is passionate about
> helping
> > > the
> > > > community in numerous ways. He spent quite some time setting up the
> > Flink
> > > > Speed Center and rebuilding the benchmark pipeline after the original
> > one
> > > > was out of lease. He helps build frocksdb and tests for the upcoming
> > > > frocksdb release (bump rocksdb from 6.20.3->8.10).
> > > >
> > > > Please join me in congratulating Zakelly for becoming an Apache Flink
> > > > committer!
> > > >
> > > > Best,
> > > > Yuan (on behalf of the Flink PMC)
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-443: Interruptible watermark processing

2024-04-30 Thread Roman Khachatryan
Thanks for the proposal, I definitely see the need for this improvement, +1.

Regards,
Roman


On Tue, Apr 30, 2024 at 3:11 PM Piotr Nowojski  wrote:

> Hi Yanfei,
>
> Thanks for the feedback!
>
> > 1. Currently when AbstractStreamOperator or AbstractStreamOperatorV2
> > processes a watermark, the watermark will be sent to downstream, if
> > the `InternalTimerServiceImpl#advanceWatermark` is interrupted, when
> > is the watermark sent downstream?
>
> The watermark would be outputted by an operator only once all relevant
> timers are fired.
> In other words, if firing of timers is interrupted a continuation mail to
> continue firing those
> interrupted timers is created. Watermark will be emitted downstream at the
> end of that
> continuation mail.
>
> > 2. IIUC, processing-timer's firing is also encapsulated into mail and
> > executed in mailbox. Is processing-timer allowed to be interrupted?
>
> Yes, both firing processing and even time timers share the same code and
> both will
> support interruptions in the same way. Actually I've renamed the FLIP from
>
> > Interruptible watermarks processing
>
> to:
>
> > Interruptible timers firing
>
> to make this more clear.
>
> Best,
> Piotrek
>
> wt., 30 kwi 2024 o 06:08 Yanfei Lei  napisał(a):
>
> > Hi Piotrek,
> >
> > Thanks for this proposal. It looks like it will shorten the checkpoint
> > duration, especially in the case of back pressure. +1 for it!  I'd
> > like to ask some questions to understand your thoughts more precisely.
> >
> > 1. Currently when AbstractStreamOperator or AbstractStreamOperatorV2
> > processes a watermark, the watermark will be sent to downstream, if
> > the `InternalTimerServiceImpl#advanceWatermark` is interrupted, when
> > is the watermark sent downstream?
> > 2. IIUC, processing-timer's firing is also encapsulated into mail and
> > executed in mailbox. Is processing-timer allowed to be interrupted?
> >
> > Best regards,
> > Yanfei
> >
> > Piotr Nowojski  于2024年4月29日周一 21:57写道:
> >
> > >
> > > Hi all,
> > >
> > > I would like to start a discussion on FLIP-443: Interruptible watermark
> > > processing.
> > >
> > > https://cwiki.apache.org/confluence/x/qgn9EQ
> > >
> > > This proposal tries to make Flink's subtask thread more responsive when
> > > processing watermarks/firing timers, and make those operations
> > > interruptible/break them apart into smaller steps. At the same time,
> the
> > > proposed solution could be potentially adopted in other places in the
> > code
> > > base as well, to solve similar problems with other flatMap-like
> operators
> > > (non windowed joins, aggregations, CepOperator, ...).
> > >
> > > I'm looking forward to your thoughts.
> > >
> > > Best,
> > > Piotrek
> >
>


Re: [DISCUSS] FLIP-444: Native file copy support

2024-04-30 Thread Roman Khachatryan
Hi Piotr,

+1 for the proposal, the recovery time improvements are significant IMO

Thanks for pushing this

Regards,
Roman


On Tue, Apr 30, 2024 at 3:15 PM Piotr Nowojski  wrote:

> Hi all!
>
> I would like to put under discussion:
>
> FLIP-444: Native file copy support
> https://cwiki.apache.org/confluence/x/rAn9EQ
>
> This proposal aims to speed up Flink recovery times, by speeding up state
> download times. However in the future, the same mechanism could be also
> used to speed up state uploading (checkpointing/savepointing).
>
> I'm curious to hear your thoughts.
>
> Best,
> Piotrek
>


Re: [VOTE] FLIP-447: Upgrade FRocksDB from 6.20.3 to 8.10.0

2024-05-06 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Mon, May 6, 2024 at 11:56 AM gongzhongqiang 
wrote:

> +1 (non-binding)
>
> Best,
> Zhongqiang Gong
>
> yue ma  于2024年5月6日周一 10:54写道:
>
> > Hi everyone,
> >
> > Thanks for all the feedback, I'd like to start a vote on the FLIP-447:
> > Upgrade FRocksDB from 6.20.3 to 8.10.0 [1]. The discussion thread is here
> > [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > insufficient votes.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-447%3A+Upgrade+FRocksDB+from+6.20.3++to+8.10.0
> > [2] https://lists.apache.org/thread/lrxjfpjjwlq4sjzm1oolx58n1n8r48hw
> >
> > --
> > Best,
> > Yue
> >
>


Re: [VOTE] FLIP-444: Native file copy support

2024-06-26 Thread Roman Khachatryan
+1 (binding)

Thanks for pushing this and updating the FLIP

Regards,
Roman


On Wed, Jun 26, 2024 at 9:27 AM Piotr Nowojski  wrote:

> Thanks for pointing this out Zakelly. After the discussion on the dev
> mailing list, I have updated the `PathsCopyingFileSystem` to merge its
> functionalities with `DuplicatingFileSystem`, but I've just forgotten to
> mention that it will removed/replaced with `PathsCopyingFileSystem`.
>
> Vote can be resumed.
>
> Best,
> Piotrek
>
> wt., 25 cze 2024 o 18:57 Piotr Nowojski  napisał(a):
>
> > Ops, I must have forgotten to update the FLIP as we discussed. I will fix
> > it tomorrow and the vote period will be extended.
> >
> > Best,
> > Piotrek
> >
> > wt., 25 cze 2024 o 13:56 Zakelly Lan  napisał(a):
> >
> >> Hi Piotrek,
> >>
> >> I don't see any statement about removing or renaming the
> >> `DuplicatingFileSystem` in the FLIP, shall we do that as mentioned in
> the
> >> discussion thread?
> >>
> >>
> >> Best,
> >> Zakelly
> >>
> >> On Tue, Jun 25, 2024 at 4:58 PM Piotr Nowojski 
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I would like to start a vote for the FLIP-444 [1]. The discussion
> >> thread is
> >> > here [2].
> >> >
> >> > The vote will be open for at least 72.
> >> >
> >> > Best,
> >> > Piotrek
> >> >
> >> > [1] https://cwiki.apache.org/confluence/x/rAn9EQ
> >> > [2] https://lists.apache.org/thread/lkwmyjt2bnmvgx4qpp82rldwmtd4516c
> >> >
> >>
> >
>


Re: [VOTE] FLIP-471: Fixing watermark idleness timeout accounting

2024-08-02 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Wed, Jul 31, 2024 at 2:39 PM Rui Fan <1996fan...@gmail.com> wrote:

> +1(binding)
>
> Best,
> Rui
>
> Timo Walther 于2024年7月31日 周三17:47写道:
>
> > +1 (binding)
> >
> > Thanks for fixing this critical bug.
> >
> > Regards,
> > Timo
> >
> > On 31.07.24 09:51, Stefan Richter wrote:
> > >
> > > +1 (binding)
> > >
> > > Best,
> > > Stefan
> > >
> > >
> > >
> > >> On 31. Jul 2024, at 04:56, Zakelly Lan  wrote:
> > >>
> > >> +1 (binding)
> > >>
> > >>
> > >> Best,
> > >> Zakelly
> > >>
> > >> On Wed, Jul 31, 2024 at 12:07 AM Piotr Nowojski  >
> > >> wrote:
> > >>
> > >>> Hi all!
> > >>>
> > >>> I would like to open the vote for FLIP-471 [1]. It has been discussed
> > here
> > >>> [2].
> > >>>
> > >>> The vote will remain open for at least 72 hours.
> > >>>
> > >>> Best,
> > >>> Piotrek
> > >>>
> > >>> [1]
> >
> https://www.google.com/url?q=https://cwiki.apache.org/confluence/x/oQvOEg&source=gmail-imap&ust=172299943500&usg=AOvVaw1i6RoAc0yxKmQkhezGamhM
> > >>> [2]
> >
> https://www.google.com/url?q=https://lists.apache.org/thread/byj1l2236rfx3mcl3v4374rcbkq4rf85&source=gmail-imap&ust=172299943500&usg=AOvVaw1z2jA41WlE0WWB_ZCstuci
> > >>>
> > >
> > >
> >
> >
>


[DISCUSS] JDBC exactly-once sink

2020-01-06 Thread Roman Khachatryan
Hi everyone,

I'm currently working on exactly-once JDBC sink implementation for Flink.
Any ideas and/or feedback are welcome.

I've considered the following options:
1. Two-phase commit. This is similar to Kafka sink.
XA or database-specific API can be used. In case of XA, each sink subtask
acts as a transaction manager, and each checkpoint-subtask pair corresponds
to an XA transaction (with a single branch)
2. Write-ahead log. This is similar to Cassandra sink.
Transactions metadata needs to be stored in the database along with data to
avoid adding duplicates after recovery.

For some scenarios, WAL might be better, but in general, XA seems to be a
better option.

==
XA vs WAL comparison
==

1. Consistency: XA preferable
WAL: longer inconsistency windows when writing from several sink subtasks

2. Performance and efficiency: XA preferable (depends on the use case)
XA:
- long-running transactions may delay GC and may hold locks (depends on the
use case)
- databases/drivers may have XA implementation issues
WAL:
- double (de)serialization and IO (first to flink state, then to database)
- read-from-state and write-to-database spikes on checkpoint completion
both may have read spikes in consumer

3. Database support: XA preferable
XA: most popular RDBMS do support it (at least mysql, pgsql, mssql, oracle,
db2, sybase)
WAL: meta table DDL may differ

4. Operability: depends on the use case
XA:
- increased undo segment (db may need to maintain a view from the
transaction start)
- abandoned transactions cleanup (abandoned tx may cause starvation if for
example database blocks inserts of duplicates in different transactions)
- (jars aren't an issue - most drivers ship XA implementation)
WAL:
- increased intermediate flink state
- need to maintain meta table

5. Simplicity: about the same
XA: more corner cases
WAL: state and meta table management
Both wrap writes into transactions

6. Testing - WAL preferable
XA requires MVVC and proper XA support (no jars needed for derby)

--
Regards,
Roman


[DISCUSS] FLIP-94 Rework 2-phase commit abstractions

2020-02-03 Thread Roman Khachatryan
Hi everyone,

I'd like to kick off the discussion on the redesign of
TwoPhaseCommitSinkFunction [1].

The primary motivation is to provide a solution that suits the needs of
both Kafka Sink and JDBC exactly once sink. Other possible scenarios
include File Sink, WAL and batch jobs.

Current abstraction doesn't support all of the requirements of the JDBC
exactly-once sink (e.g retries); besides that, it needs some (minor)
changes in the API.

FLIP-94 proposes more fine-grained abstractions and the use of composition
instead of inheritance (please see the diagram). This enables customization
of various aspects independently and eventually support of more use cases.

Any feedback welcome.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-94%3A+Rework+2-phase+commit+abstractions

-- 
Regards,
Roman


Re: [VOTE] [FLIP-76] Unaligned checkpoints

2020-03-11 Thread Roman Khachatryan
+1 (non-binding)

Regarding Yu's suggestion about *Roadmap* or *Future Work* section, I think
it's a good idea.
Currently, some MVP limitations are mentioned at the end of the document,
so we can extract and expand it.
As for the recovery speed it's not a priority currently, but we could also
mention it in this section.


On Wed, Mar 11, 2020 at 4:11 PM Zhijiang 
wrote:

> +1 (binding).
>
> As for David's concern of smaller buffers after recovery, I ever had a
> draft design [1] to solve this issue.
> You can take a look and leave comments if still have concerns. :)
>
> [1]
> https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit
>
> Best,
> Zhijiang
>
>
> --
> From:Piotr Nowojski 
> Send Time:2020 Mar. 11 (Wed.) 21:19
> To:dev 
> Subject:Re: [VOTE] [FLIP-76] Unaligned checkpoints
>
> +1 (binding).
>
> Piotrek
>
> > On 11 Mar 2020, at 09:19, David Anderson  wrote:
> >
> > +1 I like where this is headed.
> >
> > One question: during restore, it could happen that a new task manager is
> > configured with fewer or smaller buffers than was previously the case.
> How
> > will this be handled?
> >
> > David
> >
> >
> > On Wed, Mar 11, 2020 at 8:31 AM Arvid Heise  wrote:
> >
> >> Hi Thomas,
> >>
> >> it's like you said. The first version will not support rescaling and
> mostly
> >> addresses the concerns about making little to no progress because of
> >> frequent crashes.
> >>
> >> The main reason is that we cannot guarantee the ordering of non-keyed
> data
> >> (and even keyed data in some weird cases) when rescaling currently. We
> have
> >> a general concept to address that, which would also enable dynamic
> >> rescaling in the future, but that would make the changes even bigger
> and we
> >> would not have any version ready for 1.11.
> >>
> >> The current plan, of course, is to continue improving unaligned
> checkpoints
> >> immediately after release, such that we have the full feature set for
> 1.12.
> >> Potentially, unaligned checkpoints (with timeouts) would even become the
> >> default option.
> >>
> >> On Tue, Mar 10, 2020 at 11:14 PM Thomas Weise  wrote:
> >>
> >>> +1
> >>>
> >>> Thanks for putting this together, looking forward to the experimental
> >>> support in the next release.
> >>>
> >>> One clarification: since the MVP won't support rescaling, does it imply
> >>> that savepoints will always use aligned checkpointing? If so, this
> would
> >>> still block the user from taking a savepoint and resume with increased
> >>> parallelism to resolve a prolonged/permanent backpressure condition?
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Tue, Mar 10, 2020 at 6:33 AM Arvid Heise 
> wrote:
> >>>
>  Hi all,
> 
>  I would like to start the vote for FLIP-76 [1], which is discussed and
>  reached a consensus in the discussion thread [2].
> 
>  The vote will be open until March. 13th (72h), unless there is an
> >>> objection
>  or not enough votes.
> 
>  Thanks,
>  Arvid
> 
>  [1]
> 
> 
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
>  [2]
> 
> 
> >>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-76-Unaligned-checkpoints-td33651.html
> 
> >>>
> >>
>
>

-- 
Regards,
Roman


Re: [VOTE] FLIP-158: Generalized incremental checkpoints

2021-02-10 Thread Roman Khachatryan
+1 from myself (binding).

As there were no vetoes in more than 72hrs and three binding +1 the FLIP is
now accepted.

Thanks!

Regards,
Roman


On Wed, Feb 3, 2021 at 8:35 AM Arvid Heise  wrote:

> FLIP looks good to me. +1
>
> On Wed, Feb 3, 2021 at 8:00 AM Piotr Nowojski 
> wrote:
>
> > I'm carrying over my +1 from the discussion thread.
> >
> > Piotrek
> >
> > śr., 3 lut 2021 o 05:55 Yuan Mei  napisał(a):
> >
> > > As aforementioned in the discussion thread, +1 on this Flip vote.
> > >
> > > On Wed, Feb 3, 2021 at 6:36 AM Khachatryan Roman <
> > > khachatryan.ro...@gmail.com> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to start a vote on FLIP-158 [1] which was discussed in [2].
> > > >
> > > > The vote will be open for at least 72 hours. Unless there are any
> > > > objections,
> > > > I'll close it by February 5, 2021 if we have received sufficient
> votes.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
> > > >
> > > > [2]
> > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-158-Generalized-incremental-checkpoints-td47902.html
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > >
> >
>


Re: [ANNOUNCE] Welcome Roman Khachatryan a new Apache Flink Committer

2021-02-10 Thread Roman Khachatryan
Many thanks to all of you!

Regards,
Roman


On Wed, Feb 10, 2021 at 7:12 PM Matthias Pohl 
wrote:

> Congratulations, Roman! :-)
>
> On Wed, Feb 10, 2021 at 3:23 PM Kezhu Wang  wrote:
>
>> Congratulations!
>>
>> Best,
>> Kezhu Wang
>>
>>
>> On February 10, 2021 at 21:53:52, Dawid Wysakowicz (
>> dwysakow...@apache.org)
>> wrote:
>>
>> Congratulations Roman! Glad to have you on board!
>>
>> Best,
>>
>> Dawid
>>
>> On 10/02/2021 14:44, Igal Shilman wrote:
>> > Welcome Roman!
>> > Top-notch stuff! :)
>> >
>> > All the best,
>> > Igal.
>> >
>> > On Wed, Feb 10, 2021 at 2:15 PM Kostas Kloudas 
>> wrote:
>> >
>> >> Congrats Roman!
>> >>
>> >> Kostas
>> >>
>> >> On Wed, Feb 10, 2021 at 2:08 PM Arvid Heise  wrote:
>> >>> Congrats! Well deserved.
>> >>>
>> >>> On Wed, Feb 10, 2021 at 1:54 PM Yun Gao > >
>> >>> wrote:
>> >>>
>> >>>> Congratulations Roman!
>> >>>>
>> >>>> Best,
>> >>>> Yun
>> >>>>
>> >>>>
>> >>>> --Original Mail --
>> >>>> Sender:Till Rohrmann 
>> >>>> Send Date:Wed Feb 10 20:53:21 2021
>> >>>> Recipients:dev 
>> >>>> CC:Khachatryan Roman , Roman
>> Khachatryan
>> >> <
>> >>>> ro...@apache.org>
>> >>>> Subject:Re: [ANNOUNCE] Welcome Roman Khachatryan a new Apache Flink
>> >>>> Committer
>> >>>> Congratulations Roman :-)
>> >>>>
>> >>>> Cheers,
>> >>>> Till
>> >>>>
>> >>>> On Wed, Feb 10, 2021 at 1:01 PM Konstantin Knauf 
>> >>>> wrote:
>> >>>>
>> >>>>> Congratulations Roman!
>> >>>>>
>> >>>>> On Wed, Feb 10, 2021 at 11:29 AM Piotr Nowojski <
>> >> pnowoj...@apache.org>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hi everyone,
>> >>>>>>
>> >>>>>> I'm very happy to announce that Roman Khachatryan has accepted the
>> >>>>>> invitation to
>> >>>>>> become a Flink committer.
>> >>>>>>
>> >>>>>> Roman has been recently active in the runtime parts of the Flink.
>> >> He is
>> >>>>> one
>> >>>>>> of the main developers behind FLIP-76 Unaligned Checkpoints,
>> >> FLIP-151
>> >>>>>> Incremental Heap/FS State Backend [3] and providing a faster
>> >>>>> checkpointing
>> >>>>>> mechanism in FLIP-158.
>> >>>>>>
>> >>>>>> Please join me in congratulating Roman for becoming a Flink
>> >> committer!
>> >>>>>> Best,
>> >>>>>> Piotrek
>> >>>>>>
>> >>>>>> [1]
>> >>>>>>
>> >>>>>>
>> >>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
>> >>>>>> [2]
>> >>>>>>
>> >>>>>>
>> >>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-151%3A+Incremental+snapshots+for+heap-based+state+backend
>> >>>>>> [3]
>> >>>>>>
>> >>>>>>
>> >>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
>> >>>>>
>> >>>>> --
>> >>>>>
>> >>>>> Konstantin Knauf
>> >>>>>
>> >>>>> https://twitter.com/snntrable
>> >>>>>
>> >>>>> https://github.com/knaufk
>> >>>>>
>
>


Re: [DISCUSS] FLIP-151: Incremental snapshots for heap-based state backend

2021-02-15 Thread Roman Khachatryan
Thanks for your reply Stephan.

Yes, there is overlap between FLIP-151 and FLIP-158 as both
address incremental state updates. However, I think that FLIP-151 on top
of FLIP-158 increases efficiency by:

1. "Squashing" the changes made to the same key. For example, if some
counter was changed 10 times then FLIP-151 will send only the last value
(this allows to send AND store less data compared to FLIP-158)

2. Keeping in memory only the changed keys and not the values.
(this allows to reduce memory AND latency (caused by serialization +
copying on every update) compared to FLIP-158)

(1) can probably be implemented in FLIP-158, but not (2).

I don't think there will be a lot of follow-up efforts and I hope
@Dawid Wysakowicz , @pnowojski
 , Yuan Mei and probably
@Yu Li   will be able to join at different stages.

Regarding using only the confirmed checkpoints, you are right: JM can
abort non-confirmed checkpoints and discard the state. FLIP-158 has
the same problem because StateChangelog produces StateHandles that
 can be discarded by the JM. Currently, potentially discarded
changes are re-uploaded in both FLIPs.

In FLIP-158 (or follow-up), I planned to improve this part by:
1. Limiting max-concurrent-checkpoints to 1, and
2. Sending the last confirmed checkpoint ID in RPCs and barriers
So at the time of checkpoint, backend knows exactly which changes can be
included.

Handling of removed keys is not related to the aborted checkpoints. They are
needed on recovery to actually remove data from the previous snapshot.
In FLIP-158 it is again similar: ChangelogStateBackend has to encode
removal operations and send them to StateChangelog (though no additional
data structure is required).

Regards,
Roman


On Thu, Feb 11, 2021 at 4:28 PM Stephan Ewen  wrote:

> Thanks, Roman for publishing this design.
>
> There seems to be quite a bit of overlap with FLIP-158 (generalized
> incremental checkpoints).
>
> I would go with +1 to the effort if it is a pretty self-contained and
> closed effort. Meaning we don't expect that this needs a ton of follow-ups,
> other than common maintenance and small bug fixes. If we expect that this
> requires a lot of follow-ups, then we end up splitting our work between
> this FLIP and FLIP-158, which seems a bit inefficient.
> What other committers would be involved to ensure the community can
> maintain this?
>
>
> The design looks fine, in general, with one question:
>
> When persisting changes, you persist all changes that have a newer version
> than the latest one confirmed by the JM.
>
> Can you explain why it is like that exactly? Alternatively, you could keep
> the latest checkpoint ID for which the state backend persisted the diff
> successfully to the checkpoint storage, and created a state handle. For
> each checkpoint, the state backend includes the state handles of all
> involved chunks. That would be similar to the log-based approach in
> FLIP-158.
>
> I have a suspicion that this is because the JM may have released the state
> handle (and discarded the diff) for a checkpoint that succeeded on the task
> but didn't succeed globally. So we cannot reference any state handle that
> has been handed over to the JobManager, but is not yet confirmed.
>
> This characteristic seems to be at the heart of much of the complexity,
> also the handling of removed keys seems to be caused by that.
> If we could change that assumption, the design would become simpler.
>
> (Side note: I am wondering if this also impacts the FLIP-158 DSTL design.)
>
> Best,
> Stephan
>
>
> On Sun, Nov 15, 2020 at 8:51 AM Khachatryan Roman <
> khachatryan.ro...@gmail.com> wrote:
>
> > Hi Stefan,
> >
> > Thanks for your reply. Very interesting ideas!
> > If I understand correctly, SharedStateRegistry will still be responsible
> > for pruning the old state; for that, it will maintain some (ordered)
> > mapping between StateMaps and their versions, per key group.
> > I think one modification to this approach is needed to support
> journaling:
> > for each entry, maintain a version when it was last fully snapshotted;
> and
> > use this version to find the minimum as you described above.
> > I'm considering a better state cleanup and optimization of removals as
> the
> > next step. Anyway, I will add it to the FLIP document.
> >
> > Thanks!
> >
> > Regards,
> > Roman
> >
> >
> > On Tue, Nov 10, 2020 at 12:04 AM Stefan Richter <
> stefanrichte...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Very happy to see that the incremental checkpoint idea is finally
> > becoming
> > > a reality for the heap backend! Overall the proposal looks pretty good
> to
> > > me. Just wanted to point out one possible improvement from what I can
> > still
> > > remember from my ideas back then: I think you can avoid doing periodic
> > full
> > > snapshots for consolidation. Instead, my suggestion would be to track
> the
> > > version numbers you encounter while you iterate a snapshot for writing
> > it -
> > > and then you should b

[VOTE] Release 1.12.2, release candidate #1

2021-02-16 Thread Roman Khachatryan
Hi everyone,
Please review and vote on the release candidate #1 for the version 1.12.2,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 0D545F264D2DFDEBFD4E038F97B4625E2FCF517C [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.12.2-rc1" [5],
* website pull request listing the new release and adding announcement blog
post [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.


Regards,
Roman

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349502&projectId=12315522
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.2-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1413/
[5] https://github.com/apache/flink/releases/tag/release-1.12.2-rc1
[6] https://github.com/apache/flink-web/pull/418


Re: [DISCUSS] FLIP-151: Incremental snapshots for heap-based state backend

2021-02-16 Thread Roman Khachatryan
That's an interesting idea.

I guess we can decouple the actual state cleanup delegation from the
correctness issues. I don't see any reason why it can't be implemented
without changing notifications (for FLIP-158, however, we'll probably have
to ask "random" TMs because FLIP-158 adds state sharing also across
operators. Thus finding the right TM for state removal can become
difficult).

I'm not sure about the correctness issue though (FLINK-21351). What happens
if, after upscaling, JM asks one TM to discard the state; but the second TM
receives the notification with a delay? IIUC, it can refer to a discarded
state.
I think we can prevent this by sending the earliest retained checkpoint ID
in trigger RPC/barriers (instead of notifications).

However, with multiple concurrent checkpoints, it still seems not enough,
because some other checkpoint can be completed and cause in-use checkpoint
to be subsumed. Delegation to TM doesn't solve the problem because of
rescaling (and state sharing across operators).
I think we can solve it either by limiting max-concurrent-checkpoint to 1;
or by "locking" all the checkpoints that can still be in use (i.e.
num-retained-checkpoints from the earliest pending checkpoint). For
example, with num-retained-checkpoints=3:
| completed| pending  |
| cp0 | cp1 | cp2 | cp3 | cp4 |
If cp3 and cp4 both specify cp0 as the earliest retained checkpoint; then
cp0 can not be subsumed until cp4 is completed - even if cp3 is.

Aborted checkpoints are different in that they are removed from the "tail".
TM can't infer from the earliest checkpoint ID whether some later changes
could be removed. The solution would be to also add
last-completed-checkpoint to notifications/trigger-RPC/barriers.

To limit FLIP-158 scope I'd implement the last change and
limit max-concurrent-checkpoint to 1.

Regards,
Roman


On Tue, Feb 16, 2021 at 10:00 AM Stephan Ewen  wrote:

> Thanks for clarifying.
>
> Concerning the JM aborted checkpoints and state handles: I was thinking
> about it the other day as well and was considering an approach like that:
>
> The core idea is to move the cleanup from JM to TM. That solves two issues:
>
> (1) The StateBackends / DSTL delete the artifacts themselves, meaning we
> don't have to make assumptions about the state on the JM. That sounds too
> fragile, with easy bugs as soon as some slight assumptions change (see also
> bug with incr. checkpoint / savepoint data loss,
> https://issues.apache.org/jira/browse/FLINK-21351)
>
> (2) We do not need to clean up from one node. In the past, doing the
> cleanup from one node (JM) has sometimes become a bottleneck.
>
> To achieve that, we would need to extend the "notifyCheckpointComplete()"
> RPC from the JM to the TM includes both the ID of the completed checkpoint,
> and the ID of the earliest retained checkpoint. Then the TM can clean up
> all artifacts from earlier checkpoints.
>
> There are two open questions to that design:
> (1) On restore, we need to communicate the state handles of the previous
> checkpoints to the TM as well, so the TM gets again the full picture of all
> state artifacts.
> (2) On rescaling, we need to clarify which TM is responsible for releasing
> a handle, if they are mapped to multiple TMs. Otherwise we get
> double-delete calls. That isn't per se a problem, it is just a bit less
> efficient.
>
>
> Maybe we could think in that direction for the DSTL work?
>
>
>
> On Mon, Feb 15, 2021 at 8:44 PM Roman Khachatryan 
> wrote:
>
>> Thanks for your reply Stephan.
>>
>> Yes, there is overlap between FLIP-151 and FLIP-158 as both
>> address incremental state updates. However, I think that FLIP-151 on top
>> of FLIP-158 increases efficiency by:
>>
>> 1. "Squashing" the changes made to the same key. For example, if some
>> counter was changed 10 times then FLIP-151 will send only the last value
>> (this allows to send AND store less data compared to FLIP-158)
>>
>> 2. Keeping in memory only the changed keys and not the values.
>> (this allows to reduce memory AND latency (caused by serialization +
>> copying on every update) compared to FLIP-158)
>>
>> (1) can probably be implemented in FLIP-158, but not (2).
>>
>> I don't think there will be a lot of follow-up efforts and I hope
>> @Dawid Wysakowicz , @pnowojski
>>  , Yuan Mei and probably
>> @Yu Li   will be able to join at different stages.
>>
>> Regarding using only the confirmed checkpoints, you are right: JM can
>> abort non-confirmed checkpoints and discard the state. FLIP-158 has
>> the same problem because StateChangelog produces StateHandles that
>>  can be discarded by the JM. C

Re: New Jdbc XA sink - state serialization error.

2021-02-22 Thread Roman Khachatryan
Hi,

Yes, please go ahead. Thanks!

Regards,
Roman


On Mon, Feb 22, 2021 at 12:18 PM Maciej Obuchowski <
obuchowski.mac...@gmail.com> wrote:

> Hey, while working with the new 1.13 JDBC XA sink I had state restoration
> errors connected to XaSinkStateSerializer with it's implementation of
> SNAPSHOT using anonymous inner class, which is not restorable due to not
> being public.
> When changed to implementation similar to
> CheckpointAndXidSimpleTypeSerializerSnapshot state restores without
> problem.
>
> Should I provide Jira ticket and patch for this error? Sorry if that's
> obvious.
>
> Thanks,
> Maciej
>


Re: [ANNOUNCE] New Apache Flink Committers - Wei Zhong and Xingbo Huang

2021-02-22 Thread Roman Khachatryan
Congratulations!

Regards,
Roman


On Mon, Feb 22, 2021 at 12:22 PM Yangze Guo  wrote:

> Congrats, Wei and Xingbo! Well deserved!
>
> Best,
> Yangze Guo
>
> On Mon, Feb 22, 2021 at 6:47 PM Yang Wang  wrote:
> >
> > Congratulations Wei & Xingbo!
> >
> > Best,
> > Yang
> >
> > Rui Li  于2021年2月22日周一 下午6:23写道:
> >
> > > Congrats Wei & Xingbo!
> > >
> > > On Mon, Feb 22, 2021 at 4:24 PM Yuan Mei 
> wrote:
> > >
> > > > Congratulations Wei & Xingbo!
> > > >
> > > > Best,
> > > > Yuan
> > > >
> > > > On Mon, Feb 22, 2021 at 4:04 PM Yu Li  wrote:
> > > >
> > > > > Congratulations Wei and Xingbo!
> > > > >
> > > > > Best Regards,
> > > > > Yu
> > > > >
> > > > >
> > > > > On Mon, 22 Feb 2021 at 15:56, Till Rohrmann 
> > > > wrote:
> > > > >
> > > > > > Congratulations Wei & Xingbo. Great to have you as committers in
> the
> > > > > > community now.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Mon, Feb 22, 2021 at 5:08 AM Xintong Song <
> tonysong...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Congratulations, Wei & Xingbo~! Welcome aboard.
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 22, 2021 at 11:48 AM Dian Fu 
> > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > On behalf of the PMC, I’m very happy to announce that Wei
> Zhong
> > > and
> > > > > > > Xingbo
> > > > > > > > Huang have accepted the invitation to become Flink
> committers.
> > > > > > > >
> > > > > > > > - Wei Zhong mainly works on PyFlink and has driven several
> > > > important
> > > > > > > > features in PyFlink, e.g. Python UDF dependency management
> > > > (FLIP-78),
> > > > > > > > Python UDF support in SQL (FLIP-106, FLIP-114), Python UDAF
> > > support
> > > > > > > > (FLIP-139), etc. He has contributed the first PR of PyFlink
> and
> > > > have
> > > > > > > > contributed 100+ commits since then.
> > > > > > > >
> > > > > > > > - Xingbo Huang's contribution is also mainly in PyFlink and
> has
> > > > > driven
> > > > > > > > several important features in PyFlink, e.g. performance
> > > optimizing
> > > > > for
> > > > > > > > Python UDF and Python UDAF (FLIP-121, FLINK-16747,
> FLINK-19236),
> > > > > Pandas
> > > > > > > > UDAF support (FLIP-137), Python UDTF support (FLINK-14500),
> > > > row-based
> > > > > > > > Operations support in Python Table API (FLINK-20479), etc.
> He is
> > > > also
> > > > > > > > actively helping on answering questions in the user mailing
> list,
> > > > > > helping
> > > > > > > > on the release check, monitoring the status of the azure
> > > pipeline,
> > > > > etc.
> > > > > > > >
> > > > > > > > Please join me in congratulating Wei Zhong and Xingbo Huang
> for
> > > > > > becoming
> > > > > > > > Flink committers!
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Dian
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards!
> > > Rui Li
> > >
>


Re: [VOTE] Release 1.12.2, release candidate #1

2021-02-23 Thread Roman Khachatryan
Thanks Piotr,

The RC-1 is therefore cancelled.
I'll remove the artifacts and create a new RC once the issue is resolved.

Regards,
Roman


On Tue, Feb 23, 2021 at 11:57 AM Piotr Nowojski 
wrote:

> Ops, sorry I've linked the wrong issue in the previous email. This is the
> correct one:
> https://issues.apache.org/jira/browse/FLINK-21453
>
> Piotrek
>
> wt., 23 lut 2021 o 11:56 Piotr Nowojski 
> napisał(a):
>
>> -1
>>
>> I need to unfortunately change my vote to -1 because of the bug that was
>> introduced in 1.12.2:
>> https://issues.apache.org/jira/browse/FLINK-19462
>>
>> Luckily this seems to be an easy fix.
>>
>> Piotrek
>>
>> sob., 20 lut 2021 o 15:42 Yang Wang  napisał(a):
>>
>>> +1 (non-binding)
>>>
>>> * build from source
>>> * Verified some Kubernetes related fixes and improvements
>>>   * FLINK-20944 - Launching application mode with ClusterIP
>>>   * FLINK-20417 - Handle "Too old resource version" exception in
>>> Kubernetes
>>> watch more gracefully
>>>   * FLINK-20359 - Support adding Owner Reference to Job Manager in native
>>> kubernetes setup
>>> * Start a standalone job cluster, check the webUI and logs
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Yu Li  于2021年2月20日周六 下午3:23写道:
>>>
>>> > +1 (binding)
>>> >
>>> > - Checked the diff between 1.12.1 and 1.12.2: OK (
>>> >
>>> https://github.com/apache/flink/compare/release-1.12.1...release-1.12.2-rc1
>>> > )
>>> >   - jackson version has been bumped to 2.10.5.1 through FLINK-21020
>>> and all
>>> > NOTICE files updated correctly
>>> >   - beanutils version has been bumped to 1.9.4 through FLINK-21123 and
>>> all
>>> > NOTICE files updated correctly
>>> >   - testcontainer version has been bumped to 1.15.1 through
>>> FLINK-21277 and
>>> > no NOTICE files impact
>>> >   - japicmp version has been bumped to 1.12.1 and no NOTICE files
>>> impact
>>> > - Checked release notes: OK
>>> > - Checked sums and signatures: OK
>>> > - Maven clean install from source: OK
>>> > - Checked the jars in the staging repo: OK
>>> > - Checked binary archive: OK
>>> > - Approved the website updates (left some notes in PR)
>>> >
>>> > Thanks a lot for putting the RC together!
>>> >
>>> > Best Regards,
>>> > Yu
>>> >
>>> >
>>> > On Sat, 20 Feb 2021 at 12:59, Xintong Song 
>>> wrote:
>>> >
>>> > > +1 (non-binding)
>>> > >
>>> > > - verified checksums & signatures
>>> > > - reviewed website PR
>>> > > - build from source
>>> > > - run example jobs
>>> > >   - standalone session & yarn per-job
>>> > >   - jobs work as expected
>>> > >   - ui & logs look good
>>> > >
>>> > > Thank you~
>>> > >
>>> > > Xintong Song
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Feb 19, 2021 at 9:59 PM Piotr Nowojski >> >
>>> > > wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > Thanks for preparing this release candidate.
>>> > > >
>>> > > > +1 from my side
>>> > > >
>>> > > > 1. I was monitoring recent builds for Unaligned Checkpoint issues
>>> and I
>>> > > can
>>> > > > confirm that it seems we don't have any failures in the last
>>> couple of
>>> > > > weeks.
>>> > > > 2. There are no tickets with fix version set to 1.12.2. There are
>>> two
>>> > > > blocker bugs for 1.12.3, but they are still quite far from being
>>> merged
>>> > > and
>>> > > > they shouldn't block the 1.12.2 release, as they have been present
>>> in
>>> > the
>>> > > > past release already.
>>> > > > 3. I double confirmed that there are no dependency changes between
>>> > 1.12.1
>>> > > > and 1.12.2-RC1, apart of the upgrade of `commons-beanutils` from
>>> 1.9.3
>>> > to
>>> > > > 1.9.4, which had and maintained Apache 2.0 license.
>>> > > >
>>

Re: [VOTE] Release 1.12.2, release candidate #1

2021-02-23 Thread Roman Khachatryan
The artifacts removed.
Thanks everybody for taking part in the verification.

Regards,
Roman


On Tue, Feb 23, 2021 at 12:21 PM Roman Khachatryan  wrote:

> Thanks Piotr,
>
> The RC-1 is therefore cancelled.
> I'll remove the artifacts and create a new RC once the issue is resolved.
>
> Regards,
> Roman
>
>
> On Tue, Feb 23, 2021 at 11:57 AM Piotr Nowojski 
> wrote:
>
>> Ops, sorry I've linked the wrong issue in the previous email. This is the
>> correct one:
>> https://issues.apache.org/jira/browse/FLINK-21453
>>
>> Piotrek
>>
>> wt., 23 lut 2021 o 11:56 Piotr Nowojski 
>> napisał(a):
>>
>>> -1
>>>
>>> I need to unfortunately change my vote to -1 because of the bug that was
>>> introduced in 1.12.2:
>>> https://issues.apache.org/jira/browse/FLINK-19462
>>>
>>> Luckily this seems to be an easy fix.
>>>
>>> Piotrek
>>>
>>> sob., 20 lut 2021 o 15:42 Yang Wang  napisał(a):
>>>
>>>> +1 (non-binding)
>>>>
>>>> * build from source
>>>> * Verified some Kubernetes related fixes and improvements
>>>>   * FLINK-20944 - Launching application mode with ClusterIP
>>>>   * FLINK-20417 - Handle "Too old resource version" exception in
>>>> Kubernetes
>>>> watch more gracefully
>>>>   * FLINK-20359 - Support adding Owner Reference to Job Manager in
>>>> native
>>>> kubernetes setup
>>>> * Start a standalone job cluster, check the webUI and logs
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Yu Li  于2021年2月20日周六 下午3:23写道:
>>>>
>>>> > +1 (binding)
>>>> >
>>>> > - Checked the diff between 1.12.1 and 1.12.2: OK (
>>>> >
>>>> https://github.com/apache/flink/compare/release-1.12.1...release-1.12.2-rc1
>>>> > )
>>>> >   - jackson version has been bumped to 2.10.5.1 through FLINK-21020
>>>> and all
>>>> > NOTICE files updated correctly
>>>> >   - beanutils version has been bumped to 1.9.4 through FLINK-21123
>>>> and all
>>>> > NOTICE files updated correctly
>>>> >   - testcontainer version has been bumped to 1.15.1 through
>>>> FLINK-21277 and
>>>> > no NOTICE files impact
>>>> >   - japicmp version has been bumped to 1.12.1 and no NOTICE files
>>>> impact
>>>> > - Checked release notes: OK
>>>> > - Checked sums and signatures: OK
>>>> > - Maven clean install from source: OK
>>>> > - Checked the jars in the staging repo: OK
>>>> > - Checked binary archive: OK
>>>> > - Approved the website updates (left some notes in PR)
>>>> >
>>>> > Thanks a lot for putting the RC together!
>>>> >
>>>> > Best Regards,
>>>> > Yu
>>>> >
>>>> >
>>>> > On Sat, 20 Feb 2021 at 12:59, Xintong Song 
>>>> wrote:
>>>> >
>>>> > > +1 (non-binding)
>>>> > >
>>>> > > - verified checksums & signatures
>>>> > > - reviewed website PR
>>>> > > - build from source
>>>> > > - run example jobs
>>>> > >   - standalone session & yarn per-job
>>>> > >   - jobs work as expected
>>>> > >   - ui & logs look good
>>>> > >
>>>> > > Thank you~
>>>> > >
>>>> > > Xintong Song
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Fri, Feb 19, 2021 at 9:59 PM Piotr Nowojski <
>>>> pnowoj...@apache.org>
>>>> > > wrote:
>>>> > >
>>>> > > > Hi,
>>>> > > >
>>>> > > > Thanks for preparing this release candidate.
>>>> > > >
>>>> > > > +1 from my side
>>>> > > >
>>>> > > > 1. I was monitoring recent builds for Unaligned Checkpoint issues
>>>> and I
>>>> > > can
>>>> > > > confirm that it seems we don't have any failures in the last
>>>> couple of
>>>> > > > weeks.
>>>> > > > 2. There are no tickets with fix version set to 1.12.2. There are
>>>> two
>>>> > &

Re: [DISCUSS] Releasing Apache Flink 1.12.2

2021-02-25 Thread Roman Khachatryan
Hi everyone,

The issue which caused the previous RC cancellation is now resolved [1].

There are two more issues that we'd like to include in 1.12.2 [2][3].
Yuan and I are going to prepare the next RC (#2) once they are resolved.

Please let us know if there are any other issues blocking the RC.
Currently, the dashboard doesn't show any blockers [4].


[1] https://issues.apache.org/jira/browse/FLINK-21453
[2] https://issues.apache.org/jira/browse/FLINK-21490
[3] https://issues.apache.org/jira/browse/FLINK-21452
[4] https://s.apache.org/flink-1.12.2-blockers

Regards,
Roman


On Mon, Feb 15, 2021 at 5:10 PM Yuan Mei  wrote:

> Hey devs,
>
> I'm very glad to announce that all known blocking issues for release-1.12.2
> have been resolved.
>
> release-1.12.2 will freeze at:
> commit 1f8be1fd7b2b37a124e4d2b8080d08e259bdf095
>
> Roman and I are about to prepare the first release candidate and many
> thanks to Roman for agreeing to do the release together with me!
>
> We will start a separate voting thread as soon as RC1 is created.
>
> Best,
>
> Yuan
>
> On Thu, Feb 11, 2021 at 2:11 AM Matthias Pohl 
> wrote:
>
> > I created FLINK-21358 [1] (and a PR [2] for it) which we might want to
> > include in 1.12.2 as well.
> >
> > Best,
> > Matthias
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-21358
> > [2] https://github.com/apache/flink/pull/14920
> >
> > On Tue, Feb 9, 2021 at 1:19 PM Yuan Mei  wrote:
> >
> >> Thanks very much for your replies!
> >>
> >> I've been reaching out to the owners of the issues/blockers mentioned
> >> above. By now, we have
> >>
> >> Blocker Issues:
> >>
> >>-
> >>
> >>https://issues.apache.org/jira/browse/FLINK-21013: fixed
> >>-
> >>
> >>https://issues.apache.org/jira/browse/FLINK-21030: estimate to
> finish
> >>by the end of this week, but should not be a release blocker
> >>
> >> Best to Include:
> >>
> >>-
> >>
> >>https://issues.apache.org/jira/browse/FLINK-20417: fixed
> >>-
> >>
> >>https://issues.apache.org/jira/browse/FLINK-20663: estimate to
> finish
> >>by the end of this week, but should not be a release blocker
> >>-
> >>
> >>https://github.com/apache/flink/pull/14848: under review
> >>
> >>
> >> Hence, let's target *a feature freeze on Feb. 15, next Monday*!
> >>
> >> If there is any concern on anything, please feel free to contact me.
> >>
> >> Best
> >> Yuan
> >>
> >>
> >> On Tue, Feb 9, 2021 at 4:38 AM Matthias Pohl 
> >> wrote:
> >>
> >>> Fabian and I were investigating strange behavior with
> stop-with-savepoint
> >>> not terminating when using the new fromSource to add a source to a job
> >>> definition. I created FLINK-21323 [1] to cover the issue. This might
> not
> >>> be
> >>> a blocker for 1.12.2 since this bug would have been already around
> since
> >>> 1.11, if I'm not mistaken? I wanted to bring this to your attention,
> >>> anyway. It would be good if someone more familiar with this part of the
> >>> source code could verify our findings and confirm the severity of the
> >>> issue.
> >>>
> >>> Best,
> >>> Matthias
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-21323
> >>>
> >>> On Mon, Feb 8, 2021 at 7:25 PM Thomas Weise  wrote:
> >>>
> >>> > +1 for the 1.12.2 release
> >>> >
> >>> > On Mon, Feb 8, 2021 at 3:20 AM Matthias Pohl  >
> >>> > wrote:
> >>> >
> >>> > > Thanks Yuan for driving 1.12.2.
> >>> > > +1 for releasing 1.12.2
> >>> > >
> >>> > > One comment about FLINK-21030 [1]: I hope to fix it this week. But
> >>> there
> >>> > > are still some uncertainties. The underlying problem is older than
> >>> 1.12.
> >>> > > Hence, the suggestion is to not block the 1.12.2 release because of
> >>> > > FLINK-21030 [1]. I will leave it as a blocker issue, though, to
> >>> underline
> >>> > > that it should be fixed for 1.13.
> >>> > >
> >>> > > Best,
> >>> > > Matthias
> >>> > >
> >>> > > [1] https://issues.apache.org/jira/browse/FLINK-21030
> >>> > >
> >>> > > On Sun, Feb 7, 2021 at 4:10 AM Xintong Song  >
> >>> > wrote:
> >>> > >
> >>> > > > Thanks Yuan,
> >>> > > >
> >>> > > > +1 for releasing 1.12.2 and Yuan as the release manager.
> >>> > > >
> >>> > > > Thank you~
> >>> > > >
> >>> > > > Xintong Song
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Sat, Feb 6, 2021 at 3:41 PM Yu Li  wrote:
> >>> > > >
> >>> > > > > +1 for releasing 1.12.2, and thanks for volunteering to be our
> >>> > release
> >>> > > > > manager Yuan.
> >>> > > > >
> >>> > > > > Besides the mentioned issues, I could see two more blockers
> with
> >>> > 1.12.2
> >>> > > > as
> >>> > > > > fix version [1] and need some tracking:
> >>> > > > > * FLINK-21013 <
> https://issues.apache.org/jira/browse/FLINK-21013
> >>> >
> >>> > > Blink
> >>> > > > > planner does not ingest timestamp into StreamRecord
> >>> > > > > * FLINK- 21030  disjoint
> >>> > graph>
> >>> > > > > Broken
> >>> > > > > job restart for job with disjoint graph
> >>> > > > >
> >>> > > > > Best Regards,
> >>> > > > >

Re: [DISCUSS] Releasing Apache Flink 1.12.2

2021-02-26 Thread Roman Khachatryan
Thanks for the update and for reporting the issue Matthias.

I see that FLINK-21030 is now resolved [1]
I can confirm that FLINK-21515 is indeed a test stability not affecting
production code (I've published a patch to fix it).

Therefore, I will proceed to building the next RC
from e9af362f0caa16e75e66aa1403c62666e77f98f0.

[1] https://issues.apache.org/jira/browse/FLINK-21030

Regards,
Roman


On Fri, Feb 26, 2021 at 1:43 PM Matthias Pohl 
wrote:

> There's a test instability which I assume is caused by FLINK-21028 [1]. I
> created FLINK-21515 [2] to cover this issue.
> Could somebody verify that? I validated already that this issue appears on
> master as well without the changes added in FLINK-21030 [3].
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-21028
> [2] https://issues.apache.org/jira/browse/FLINK-21515
> [3] https://issues.apache.org/jira/browse/FLINK-21030
>
> On Fri, Feb 26, 2021 at 12:25 PM Matthias Pohl 
> wrote:
>
>> FYI: I created the PRs for merging FLINK-21030 [1] into release-1.12 [2]
>> and release-1.11 [3].
>>
>> Best,
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-21030
>> [2] https://github.com/apache/flink/pull/15034
>> [3] https://github.com/apache/flink/pull/15035
>>
>> On Thu, Feb 25, 2021 at 5:18 PM Till Rohrmann 
>> wrote:
>>
>>> If time allows then we could also include FLINK-21030 [1]. We are about
>>> to
>>> merge this fix.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-21030
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Feb 25, 2021 at 4:13 PM Roman Khachatryan 
>>> wrote:
>>>
>>> > Hi everyone,
>>> >
>>> > The issue which caused the previous RC cancellation is now resolved
>>> [1].
>>> >
>>> > There are two more issues that we'd like to include in 1.12.2 [2][3].
>>> > Yuan and I are going to prepare the next RC (#2) once they are
>>> resolved.
>>> >
>>> > Please let us know if there are any other issues blocking the RC.
>>> > Currently, the dashboard doesn't show any blockers [4].
>>> >
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-21453
>>> > [2] https://issues.apache.org/jira/browse/FLINK-21490
>>> > [3] https://issues.apache.org/jira/browse/FLINK-21452
>>> > [4] https://s.apache.org/flink-1.12.2-blockers
>>> >
>>> > Regards,
>>> > Roman
>>> >
>>> >
>>> > On Mon, Feb 15, 2021 at 5:10 PM Yuan Mei 
>>> wrote:
>>> >
>>> > > Hey devs,
>>> > >
>>> > > I'm very glad to announce that all known blocking issues for
>>> > release-1.12.2
>>> > > have been resolved.
>>> > >
>>> > > release-1.12.2 will freeze at:
>>> > > commit 1f8be1fd7b2b37a124e4d2b8080d08e259bdf095
>>> > >
>>> > > Roman and I are about to prepare the first release candidate and many
>>> > > thanks to Roman for agreeing to do the release together with me!
>>> > >
>>> > > We will start a separate voting thread as soon as RC1 is created.
>>> > >
>>> > > Best,
>>> > >
>>> > > Yuan
>>> > >
>>> > > On Thu, Feb 11, 2021 at 2:11 AM Matthias Pohl <
>>> matth...@ververica.com>
>>> > > wrote:
>>> > >
>>> > > > I created FLINK-21358 [1] (and a PR [2] for it) which we might
>>> want to
>>> > > > include in 1.12.2 as well.
>>> > > >
>>> > > > Best,
>>> > > > Matthias
>>> > > >
>>> > > > [1] https://issues.apache.org/jira/browse/FLINK-21358
>>> > > > [2] https://github.com/apache/flink/pull/14920
>>> > > >
>>> > > > On Tue, Feb 9, 2021 at 1:19 PM Yuan Mei 
>>> > wrote:
>>> > > >
>>> > > >> Thanks very much for your replies!
>>> > > >>
>>> > > >> I've been reaching out to the owners of the issues/blockers
>>> mentioned
>>> > > >> above. By now, we have
>>> > > >>
>>> > > >> Blocker Issues:
>>> > > >>
>>> > > >>-
>>> > > >>
>>> > > >>https://issues.apache.org/jir

[VOTE] Release 1.12.2, release candidate #2

2021-02-26 Thread Roman Khachatryan
Hi everyone,

Please review and vote on the release candidate #1 for the version 1.12.2,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 0D545F264D2DFDEBFD4E038F97B4625E2FCF517C [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.12.2-rc2" [5],
* website pull request listing the new release and adding announcement blog
post [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349502&projectId=12315522
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.2-rc2/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1414/
[5] https://github.com/apache/flink/releases/tag/release-1.12.2-rc2
[6] https://github.com/apache/flink-web/pull/418

Regards,
Roman


[DISCUSS] Splitting User support mailing list

2021-03-01 Thread Roman Khachatryan
Hi everyone,

I'd like to propose to extract several "sub-lists" from our user mailing
list (u...@flink.apache.org).

For example,
- user-sql@flink.a.o (Python)
- user-statefun@f.a.o (StateFun)
- user-py@f.a.o. (SQL/TableAPI)
And u...@flink.apache.org will remain the main or "default" list.

That would improve the quality and speed of the answers and allow
developers to concentrate on the relevant topics.

At the downside, this would lessen the exposure to the various Flink areas
for lists maintainers.

What do you think?

Regards,
Roman


[VOTE] FLIP-151: Incremental snapshots for heap-based state backend

2021-03-01 Thread Roman Khachatryan
Hi everyone,

since the discussion [1] about FLIP-151 [2] seems to have reached a
consensus, I'd like to start a formal vote for the FLIP.

Please vote +1 to approve the FLIP, or -1 with a comment. The vote will be
open at least until Wednesday, Mar 3rd.

[1] https://s.apache.org/flip-151-discussion
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-151%3A+Incremental+snapshots+for+heap-based+state+backend

Regards,
Roman


Re: [DISCUSS] Splitting User support mailing list

2021-03-01 Thread Roman Khachatryan
Thanks for your replies!

@Konstantin Knauf 
> Why do you think the quality and speed of answers would improve with
dedicated lists?
If there is a question on something that you are not an expert in; then you
either have to
- pull in someone who is more experienced in it (more time on hops, esp. if
the pulled in person isn't available)
- or learn it and answer yourself (more time on learning and still higher
chance of missing something)

@Timo Walther  and @Dawid Wysakowicz

> I fear that we are creating potential silos where a team doesn't know
> what is going on in the other teams.
I think some specialization is unavoidable in a big project like Flink or
Linux (which also has separate lists).
And user support ML doesn't seem to me the right tool to deal with it.

@Dawid Wysakowicz 
> Personally I don't find it problematic. I often find the subjects quite
> descriptive, they often include tags or mention which API they refer to.
Yes, but that only means that the sender would already know the "right"
list.

@Konstantin Knauf  and @j...@apache.org 

I agree that there are crosscutting areas; and also a chance of sending a
message to the wrong topic.
But splitting doesn't change anything here: if a SQL question for example
is asked on StateFun ML then
we still have the options above (plus an option to redirect user to the
other list).

Regards,
Roman


On Mon, Mar 1, 2021 at 11:30 AM Dawid Wysakowicz 
wrote:

> As others I'd also rather be -1 on splitting (even splitting out the
> statefun).
>
> Personally I don't find it problematic. I often find the subjects quite
> descriptive, they often include tags or mention which API they refer to.
> If they don't I am quite sure having separate sub-lists would not help
> in those cases anyway. I agree with the others that splitting the list
> would make the cross communication harder and create knowledge silos.
>
> It would also incur more requirements on users which already often find
> ML counter intuitive (See e.g. the discussion about adding a Flink slack)
>
> Best,
>
> Dawid
>
> On 01/03/2021 11:20, Timo Walther wrote:
> > I would vote -0 here.
> >
> > I fear that we are creating potential silos where a team doesn't know
> > what is going on in the other teams.
> >
> > Regards,
> > Timo
> >
> >
> > On 01.03.21 10:47, Jark Wu wrote:
> >> I also have some concerns about splitting python and sql.
> >> Because I have seen some SQL questions users reported but is related to
> >> deployment or state backend.
> >>
> >> Best,
> >> Jark
> >>
> >> On Mon, 1 Mar 2021 at 17:15, Konstantin Knauf  >
> >> wrote:
> >>
> >>> Hi Roman,
> >>>
> >>> I slightly +1 for a list dedicated to Statefun users, but -1 for
> >>> splitting
> >>> up the rest. I think there are still a lot of crosscutting concerns
> >>> between
> >>> Python, DataStream, Table API and SQL where users of another API can
> >>> also
> >>> help out, too. It also requires users to think about which lists to
> >>> subscribe/write to, instead of simply subscribing to one list.
> >>>
> >>> Why do you think the quality and speed of answers would improve with
> >>> dedicated lists?
> >>>
> >>> Best,
> >>>
> >>> Konstantin
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Mar 1, 2021 at 10:09 AM xiao...@ysstech.com
> >>> 
> >>> wrote:
> >>>
> >>>> Hi Roman,
> >>>>
> >>>> This is a very good idea. I will look forward to the official
> >>>> setting up
> >>>> "sub-lists" as soon as possible and sharing development experience and
> >>>> problems with friends in a certain field.
> >>>>
> >>>> Regards,
> >>>> yue
> >>>>
> >>>>
> >>>>
> >>>> xiao...@ysstech.com
> >>>>
> >>>> From: Roman Khachatryan
> >>>> Date: 2021-03-01 16:48
> >>>> To: dev
> >>>> Subject: [DISCUSS] Splitting User support mailing list
> >>>> Hi everyone,
> >>>>
> >>>> I'd like to propose to extract several "sub-lists" from our user
> >>>> mailing
> >>>> list (u...@flink.apache.org).
> >>>>
> >>>> For example,
> >>>> - user-sql@flink.a.o (Python)
> >>>> - us

Re: [DISCUSS] Splitting User support mailing list

2021-03-01 Thread Roman Khachatryan
@tzuli...@apache.org 
> instead of splitting into “sub-lists”, we should simply have dedicated
“sub-topic maintainers” assigned.
I think this could also work, but some mails may fall between the filters.

@i...@ververica.com 
I guess the previous decision about StateFun ML was made in a bit different
context: no other sub-lists and no data about the list.

Regards,
Roman


On Mon, Mar 1, 2021 at 2:59 PM Igal Shilman  wrote:

> Hi Roman,
>
> Regarding StateFun having a separate mailing list, I'm ok with it going
> either-way, however when we first contributed
> the project there was already a discussion about having a separate mailing
> list for StateFun [1] and the feedback was
> having StateFun be part of the regular mailing list.
>
>
> [1] https://www.mail-archive.com/dev@flink.apache.org/msg31464.html
>
> On Mon, Mar 1, 2021 at 12:25 PM Tzu-Li (Gordon) Tai 
> wrote:
>
>> Hi,
>>
>> I feel that the issues Roman has pointed out so far, is less a problem of
>> all topics (SQL / PyFlink / StateFun) being on the same list, and more a
>> problem that we are missing dedicated groups of “user support shepherds”
>> who are specifically responsible for individual topics on a day-to-day
>> basis.
>>
>> In the distant past, we used to assign shepherds for individual components
>> in Flink.
>> Perhaps something similar to that, but specifically for daily user mailing
>> lists support, is already sufficient to solve the mentioned problems.
>> So essentially, instead of splitting into “sub-lists”, we should simply
>> have dedicated “sub-topic maintainers” assigned.
>>
>> For example, for myself, I set a filter on my email client to look
>> specifically for “Stateful Functions / StateFun” mentions, and tag it
>> appropriately.
>> This already allows me to concentrate on StateFun questions, without
>> losing
>> the exposure to other things happening in the wider Flink project.
>> As far as I can tell, except for some more tricky questions, the
>> turnaround
>> time for StateFun user questions has been ok so far.
>>
>> What do you think?
>>
>> Cheers,
>> Gordon
>>
>> On Mon, Mar 1, 2021 at 6:56 PM Roman Khachatryan 
>> wrote:
>>
>> > Thanks for your replies!
>> >
>> > @Konstantin Knauf 
>> > > Why do you think the quality and speed of answers would improve with
>> > dedicated lists?
>> > If there is a question on something that you are not an expert in; then
>> you
>> > either have to
>> > - pull in someone who is more experienced in it (more time on hops,
>> esp. if
>> > the pulled in person isn't available)
>> > - or learn it and answer yourself (more time on learning and still
>> higher
>> > chance of missing something)
>> >
>> > @Timo Walther  and @Dawid Wysakowicz
>> > 
>> > > I fear that we are creating potential silos where a team doesn't know
>> > > what is going on in the other teams.
>> > I think some specialization is unavoidable in a big project like Flink
>> or
>> > Linux (which also has separate lists).
>> > And user support ML doesn't seem to me the right tool to deal with it.
>> >
>> > @Dawid Wysakowicz 
>> > > Personally I don't find it problematic. I often find the subjects
>> quite
>> > > descriptive, they often include tags or mention which API they refer
>> to.
>> > Yes, but that only means that the sender would already know the "right"
>> > list.
>> >
>> > @Konstantin Knauf  and @j...@apache.org <
>> > j...@apache.org>
>> >
>> > I agree that there are crosscutting areas; and also a chance of sending
>> a
>> > message to the wrong topic.
>> > But splitting doesn't change anything here: if a SQL question for
>> example
>> > is asked on StateFun ML then
>> > we still have the options above (plus an option to redirect user to the
>> > other list).
>> >
>> > Regards,
>> > Roman
>> >
>> >
>> > On Mon, Mar 1, 2021 at 11:30 AM Dawid Wysakowicz <
>> dwysakow...@apache.org>
>> > wrote:
>> >
>> > > As others I'd also rather be -1 on splitting (even splitting out the
>> > > statefun).
>> > >
>> > > Personally I don't find it problematic. I often find the subjects
>> quite
>> > > descriptive, they often include tags or mention which API they refer
>> to.
>> > > If they don't I 

Re: [DISCUSS] Apache Flink Jira Process

2021-03-01 Thread Roman Khachatryan
Hi,

Thanks for the proposal Konstantin,
I like the ideas expressed there.

I am a bit concerned about the new issue type "Technical Debt". In contrast
to other issue types, it doesn't imply that someone will likely work on
that. So it can linger until the bot closes it.
Probably we need some rules requiring a person opening such a ticket to
have an intention to work on it in the near future?
Another approach would be some wiki space.

As for the trivial priority, I would remove it and (use labels where
appropriate) as you suggested.

Regards,
Roman


On Mon, Mar 1, 2021 at 11:53 AM Konstantin Knauf 
wrote:

> Hi Dawid,
>
> Thanks for the feedback. Do you think we should simply get rid of the
> "Trivial" priority then and use the "starter" label more aggressively?
>
> Best,
>
> Konstantin
>
> On Mon, Mar 1, 2021 at 11:44 AM Dawid Wysakowicz 
> wrote:
>
> > Hi Konstantin,
> >
> > I also like the idea.
> >
> > Two comments:
> >
> > * you describe the "Trivial" priority as one that needs to be
> > implemented immediately. First of all it is not used to often, but I
> > think the way it works now is similar with a "starter" label. Tasks that
> > are not bugs, are easy to implement and we think they are fine to be
> > taken by newcomers. Therefore they do not fall in my mind into
> > "immediately" category.
> >
> > * I would still deprioritise test instabilities. I think there shouldn't
> > be a problem with that. We do post links to all failures therefore it
> > will automatically priortise the tasks according to failure frequencies.
> >
> > Best,
> >
> > Dawid
> >
> > On 01/03/2021 09:38, Konstantin Knauf wrote:
> > > Hi Xintong,
> > >
> > > yes, such labels would make a lot of sense. I added a sentence to the
> > > document.
> > >
> > > Thanks,
> > >
> > > Konstantin
> > >
> > > On Mon, Mar 1, 2021 at 8:51 AM Xintong Song 
> > wrote:
> > >
> > >> Thanks for driving this discussion, Konstantin.
> > >>
> > >> I like the idea of having a bot reminding reporter/assignee/watchers
> > about
> > >> inactive tickets and if needed downgrade/close them automatically.
> > >>
> > >> My two cents:
> > >> We may have labels like "downgraded-by-bot" / "closed-by-bot", so that
> > it's
> > >> easier to filter and review tickets updated by the bot.
> > >> We may want to review such tickets (e.g., monthly) in case a valid
> > ticket
> > >> failed to draw the attention of relevant committers and the reporter
> > >> doesn't know who to ping.
> > >>
> > >> Thank you~
> > >>
> > >> Xintong Song
> > >>
> > >>
> > >>
> > >> On Sat, Feb 27, 2021 at 1:37 AM Till Rohrmann 
> > >> wrote:
> > >>
> > >>> Thanks for starting this discussion Konstantin. I like your proposal
> > and
> > >>> also the idea of automating the tedious parts of it via a bot.
> > >>>
> > >>> Cheers,
> > >>> Till
> > >>>
> > >>> On Fri, Feb 26, 2021 at 4:17 PM Konstantin Knauf 
> > >>> wrote:
> > >>>
> >  Dear Flink Community,
> > 
> >  I would like to start a discussion on improving and to some extent
> > >> simply
> >  defining the way we work with Jira. Some aspects have been
> discussed a
> >  while back [1], but I would like to go a bit beyond that with the
> > >>> following
> >  goals in mind:
> > 
> > 
> > -
> > 
> > clearer communication and expectation management with the
> community
> > -
> > 
> >    a user or contributor should be able to judge the urgency of a
> > >>> ticket
> >    by its priority
> >    -
> > 
> >    if a ticket is assigned to someone the expectation that
> someone
> > >> is
> >    working on it should hold
> >    -
> > 
> > generally reduce noise in Jira
> > -
> > 
> > reduce overhead of committers to ask about status updates of
> > contributions or bug reports
> > -
> > 
> >    “Are you still working on this?”
> >    -
> > 
> >    “Are you still interested in this?”
> >    -
> > 
> >    “Does this still happen on Flink 1.x?”
> >    -
> > 
> >    “Are you still experiencing this issue?”
> >    -
> > 
> >    “What is the status of the implementation”?
> >    -
> > 
> > while still encouraging users to add new tickets and to leave
> > >> feedback
> > about existing tickets
> > 
> > 
> >  Please see the full proposal here:
> > 
> > 
> > >>
> >
> https://docs.google.com/document/d/19VmykDSn4BHgsCNTXtN89R7xea8e3cUIl-uivW8L6W8/edit#
> >  .
> > 
> >  The idea would be to discuss this proposal in this thread. If we
> come
> > >> to
> > >>> a
> >  conclusion, I'd document the proposal in the wiki [2] and we would
> > then
> >  vote on it (approval by "Lazy Majority").
> > 
> >  Cheers,
> > 
> >  Konstantin
> > 
> >  [1]
> > 
> > 
> > >>
> >
> https://lists.apache.org/thread.html/rd34fb695d371c2

Re: [VOTE] Release 1.12.2, release candidate #2

2021-03-02 Thread Roman Khachatryan
Hi everyone,

+1 (binding)

I've verified the checksums and the source distribution with maven verify.

Regards,
Roman


On Tue, Mar 2, 2021 at 7:29 AM Kurt Young  wrote:

> +1 (binding)
>
> - We mainly checked the patch of FLINK-20663 [1] and confirmed there is no
> OutOfManagedMemory error anymore.
>
> [1] https://issues.apache.org/jira/browse/FLINK-20663
>
> Best,
> Kurt
>
>
> On Tue, Mar 2, 2021 at 12:41 PM Yu Li  wrote:
>
> > +1 (binding)
> >
> > - Checked the diff between 1.12.1 and 1.12.2-rc2: OK (
> >
> https://github.com/apache/flink/compare/release-1.12.1...release-1.12.2-rc2
> > )
> >   - jackson version has been bumped to 2.10.5.1 through FLINK-21020 and
> all
> > NOTICE files updated correctly
> >   - beanutils version has been bumped to 1.9.4 through FLINK-21123 and
> all
> > NOTICE files updated correctly
> >   - testcontainer version has been bumped to 1.15.1 through FLINK-21277
> and
> > no NOTICE files impact
> >   - japicmp version has been bumped to 1.12.1 and no NOTICE files impact
> > - Checked release notes: OK
> > - Checked sums and signatures: OK
> > - Maven clean install from source: OK
> > - Checked the jars in the staging repo: OK
> > - Checked the website updates: OK (minor: corrected fix version of
> > FLINK-21515 <https://issues.apache.org/jira/browse/FLINK-21515> to make
> > sure the website PR consistent with release note)
> >
> > Note: there's a vulnerability suspicion against 1.12.2-rc2 reported in
> > user-zh mailing list [1] w/o enough evidence/information. Have asked the
> > reporter to do more testing to confirm and I don't think it's a blocker
> for
> > the release, but just a note here in case anyone has a different opinion.
> >
> > Thanks a lot for managing the new RC!
> >
> > Best Regards,
> > Yu
> >
> > [1]
> http://apache-flink.147419.n8.nabble.com/flink-1-12-2-rc2-td11023.html
> >
> > On Tue, 2 Mar 2021 at 01:51, Piotr Nowojski 
> wrote:
> >
> > > +1 (binding)
> > >
> > > For the RC2 I have additionally confirmed that "stop-with-savepoint",
> and
> > > "stop-with-savepoint --drain" seems to be working.
> > >
> > > Piotrek
> > >
> > > pon., 1 mar 2021 o 11:18 Matthias Pohl 
> > > napisał(a):
> > >
> > > > Thanks for managing release 1.12.2, Yuan & Roman.
> > > >
> > > > +1 (non-binding)
> > > >
> > > > - Verified checksums and GPG of artifacts in [1]
> > > > - Build the sources locally without errors
> > > > - Started a local standalone cluster and deployed WordCount without
> > > > problems (no suspicious logs identified)
> > > > - Verified FLINK-21030 [2] by running the example jobs from the
> > > > FLINK-21030-related SavepointITCase tests
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.2-rc2/
> > > > [2] https://issues.apache.org/jira/browse/FLINK-21030
> > > >
> > > > On Sun, Feb 28, 2021 at 2:41 PM Yuan Mei 
> > wrote:
> > > >
> > > > > Hey Roman,
> > > > >
> > > > > Thank you very much for preparing RC2.
> > > > >
> > > > > +1 from my side.
> > > > >
> > > > > 1. Verified Checksums and GPG signatures.
> > > > > 2. Verified that the source archives do not contain any binaries.
> > > > > 3. Successfully Built the source with Maven.
> > > > > 4. Started a local Flink cluster, ran the streaming WordCount
> example
> > > > with
> > > > > WebUI,
> > > > > checked the output and JM/TM log, no suspicious output/log.
> > > > > 5. Repeat Step 4 with the binary release as well, no suspicious
> > > > output/log.
> > > > > 6. Checked for source and binary release to make sure both an
> Apache
> > > > > License file and a NOTICE file are included.
> > > > > 7. Manually verified that no pom file changes between 1.12.2-rc1
> and
> > > > > 1.12.2-rc2; no obvious license problem.
> > > > > 8. Review the release PR for RC2 updates, and double confirmed the
> > > > > change-list for 1.12.2.
> > > > >
> > > > > Best,
> > > > > Yuan
> > > > >
> > > > > On Sat, Feb 27, 2021 at 7

Re: [VOTE] Release 1.12.2, release candidate #2

2021-03-02 Thread Roman Khachatryan
Thank you all for helping to verify and test the RC!

The vote has lasted for more than 72 hours and has enough approvals.
I will finalize the vote result soon in a separate email.

Regards,
Roman


On Tue, Mar 2, 2021 at 10:33 AM Zhu Zhu  wrote:

> +1 (binding)
>  - checked signatures and checksums
>  - built from source
>  - ran medium scale streaming and batch jobs (parallelism=1000) on a YARN
> cluster. checked output and logs.
>  - the web PR looks good
>
> Thanks,
> Zhu
>
> Roman Khachatryan  于2021年3月2日周二 下午4:20写道:
>
> > Hi everyone,
> >
> > +1 (binding)
> >
> > I've verified the checksums and the source distribution with maven
> verify.
> >
> > Regards,
> > Roman
> >
> >
> > On Tue, Mar 2, 2021 at 7:29 AM Kurt Young  wrote:
> >
> > > +1 (binding)
> > >
> > > - We mainly checked the patch of FLINK-20663 [1] and confirmed there is
> > no
> > > OutOfManagedMemory error anymore.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-20663
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Mar 2, 2021 at 12:41 PM Yu Li  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > - Checked the diff between 1.12.1 and 1.12.2-rc2: OK (
> > > >
> > >
> >
> https://github.com/apache/flink/compare/release-1.12.1...release-1.12.2-rc2
> > > > )
> > > >   - jackson version has been bumped to 2.10.5.1 through FLINK-21020
> and
> > > all
> > > > NOTICE files updated correctly
> > > >   - beanutils version has been bumped to 1.9.4 through FLINK-21123
> and
> > > all
> > > > NOTICE files updated correctly
> > > >   - testcontainer version has been bumped to 1.15.1 through
> FLINK-21277
> > > and
> > > > no NOTICE files impact
> > > >   - japicmp version has been bumped to 1.12.1 and no NOTICE files
> > impact
> > > > - Checked release notes: OK
> > > > - Checked sums and signatures: OK
> > > > - Maven clean install from source: OK
> > > > - Checked the jars in the staging repo: OK
> > > > - Checked the website updates: OK (minor: corrected fix version of
> > > > FLINK-21515 <https://issues.apache.org/jira/browse/FLINK-21515> to
> > make
> > > > sure the website PR consistent with release note)
> > > >
> > > > Note: there's a vulnerability suspicion against 1.12.2-rc2 reported
> in
> > > > user-zh mailing list [1] w/o enough evidence/information. Have asked
> > the
> > > > reporter to do more testing to confirm and I don't think it's a
> blocker
> > > for
> > > > the release, but just a note here in case anyone has a different
> > opinion.
> > > >
> > > > Thanks a lot for managing the new RC!
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > > > [1]
> > > http://apache-flink.147419.n8.nabble.com/flink-1-12-2-rc2-td11023.html
> > > >
> > > > On Tue, 2 Mar 2021 at 01:51, Piotr Nowojski 
> > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > For the RC2 I have additionally confirmed that
> "stop-with-savepoint",
> > > and
> > > > > "stop-with-savepoint --drain" seems to be working.
> > > > >
> > > > > Piotrek
> > > > >
> > > > > pon., 1 mar 2021 o 11:18 Matthias Pohl 
> > > > > napisał(a):
> > > > >
> > > > > > Thanks for managing release 1.12.2, Yuan & Roman.
> > > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - Verified checksums and GPG of artifacts in [1]
> > > > > > - Build the sources locally without errors
> > > > > > - Started a local standalone cluster and deployed WordCount
> without
> > > > > > problems (no suspicious logs identified)
> > > > > > - Verified FLINK-21030 [2] by running the example jobs from the
> > > > > > FLINK-21030-related SavepointITCase tests
> > > > > >
> > > > > > Best,
> > > > > > Matthias
> > > > > >
> > > > > > [1]
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.2-rc2/
> > > > > > [2] https://is

[RESULT] [VOTE] Release 1.12.2, release candidate #2

2021-03-02 Thread Roman Khachatryan
I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 5 of which are binding:
* Kurt Young
* Piotr Nowojski
* Roman Khachatryan
* Yu Li
* Zhu Zhu

There are no disapproving votes.

Thanks everyone!

Regards,
Roman


Re: [DISCUSS] Splitting User support mailing list

2021-03-02 Thread Roman Khachatryan
Thanks Robert,

That's a good idea, let's revisit it later.

Regards,
Roman


On Tue, Mar 2, 2021 at 3:40 PM Robert Metzger  wrote:

> Thanks a lot for bringing up this idea Roman!
>
> After reading the initial proposal, I quite liked the idea, because it
> makes our life easier: We can only monitor lists relevant for the topics we
> are working on (I have to admit that I usually skip all questions that seem
> to be related to SQL or Statefun).
> There are a few Apache projects which have followed a similar approach [1],
> most notable maybe the Hadoop project, which has a user@, as well as
> hdfs-user@, mapreduce-user@, ozone-user@ etc. There, it seems that
> sub-projects have separate lists. This would support the idea of splitting
> out statefun into a separate list.
>
> But the majority of people who have commented so far seem to have concerns
> regarding the proposal, which seem reasonable.
> I propose to revisit this proposal at a later point.
>
>
>
> [1] http://mail-archives.apache.org/mod_mbox/
>


Re: [RESULT] [VOTE] Release 1.12.2, release candidate #2

2021-03-02 Thread Roman Khachatryan
Yes, you are right.

There are 7 approving votes, 4 of which are binding:
* Kurt Young
* Piotr Nowojski
* Yu Li
* Zhu Zhu

There are no disapproving votes.

Thanks everyone!

Regards,
Roman


On Tue, Mar 2, 2021 at 9:29 PM Dawid Wysakowicz 
wrote:

> Yes, Henry is right. Binding votes on Apache releases must come from PMC
> members.
>
> Out of the votes 4 are binding:
>
> * Kurt Young
> * Piotr Nowojski
> * Yu Li
> * Zhu Zhu
>
> Best,
> Dawid
>
> On 02/03/2021 21:17, Henry Saputra wrote:
> > Roman, binding Votes only come from Apache Flink PMC members.
> >
> > From the list you had seems like only 3 binding Votes, right?
> >
> >
> >
> > On Tue, Mar 2, 2021, 7:08 AM Roman Khachatryan  wrote:
> >
> >> I'm happy to announce that we have unanimously approved this release.
> >>
> >> There are 7 approving votes, 5 of which are binding:
> >> * Kurt Young
> >> * Piotr Nowojski
> >> * Roman Khachatryan
> >> * Yu Li
> >> * Zhu Zhu
> >>
> >> There are no disapproving votes.
> >>
> >> Thanks everyone!
> >>
> >> Regards,
> >> Roman
> >>
>
>


[ANNOUNCE] Apache Flink 1.12.2 released

2021-03-05 Thread Roman Khachatryan
The Apache Flink community is very happy to announce the release of Apache
Flink 1.12.2, which is the second bugfix release for the Apache Flink 1.12
series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this bugfix release:
https://flink.apache.org/news/2021/03/03/release-1.12.2.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12349502&projectId=12315522

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Special thanks to Yuan Mei for managing the release and PMC members Robert
Metzger, Chesnay Schepler and Piotr Nowojski.

Regards,
Roman


Re: [VOTE] FLIP-151: Incremental snapshots for heap-based state backend

2021-03-05 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman


On Thu, Mar 4, 2021 at 7:00 PM Yu Li  wrote:
>
> +1 (binding)
>
> The latest FLIP document LGTM. Thanks for driving this Roman!
>
> Best Regards,
> Yu
>
>
> On Thu, 4 Mar 2021 at 18:24, David Anderson  wrote:
>
> > +1 (non-binding)
> >
> > On Mon, Mar 1, 2021 at 10:12 AM Roman Khachatryan 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > since the discussion [1] about FLIP-151 [2] seems to have reached a
> > > consensus, I'd like to start a formal vote for the FLIP.
> > >
> > > Please vote +1 to approve the FLIP, or -1 with a comment. The vote will
> > be
> > > open at least until Wednesday, Mar 3rd.
> > >
> > > [1] https://s.apache.org/flip-151-discussion
> > > [2]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-151%3A+Incremental+snapshots+for+heap-based+state+backend
> > >
> > > Regards,
> > > Roman
> > >
> >


[RESULT][VOTE] FLIP-151: Incremental snapshots for heap-based state backend

2021-03-05 Thread Roman Khachatryan
Hi everyone,

The voting time for FLIP-151: Incremental snapshots for heap-based
state backend [1] has passed. I'm closing the vote now.

There were 5 +1 votes, 4 of which are binding:

- Piotr Nowojski (binding)
- Zhijiang (binding)
- David Anderson (non-binding)
- Yu Li (binding)
- Roman Khachatryan (binding)

There were no -1 votes.

Thus, FLIP-151 has been accepted.

Thanks everyone for joining the discussion and giving feedback!

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-FLIP-151-Incremental-snapshots-for-heap-based-state-backend-td49060.html

Regards,
Roman


Re: [VOTE] FLIP-165: Operator's Flame Graphs

2021-03-05 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman

On Fri, Mar 5, 2021 at 3:59 PM Till Rohrmann  wrote:
>
> +1 (binding)
>
> Cheers,
> Till
>
> On Fri, Mar 5, 2021 at 2:43 PM Alexander Fedulov 
> wrote:
>
> > Hi all,
> >
> > I would like to start a vote on FLIP-165 [1] which was discussed in [2] and
> > proposed to be moved to the voting stage.
> >
> > The vote will be open until March 11th, 2021 10:00 AM CET.
> >
> > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-165
> > %3A+Operator%27s+Flame+Graphs
> >
> > [2]
> >
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-165-Operator-s-Flame-Graphs-td49097.html
> >
> > Best,
> >
> > --
> >
> > Alexander Fedulov | Solutions Architect
> >
> > 
> >
> > Follow us @VervericaData
> >


Re: [VOTE] Apache Flink Jira Process (& Bot)

2021-03-26 Thread Roman Khachatryan
+1,

thanks for this improvement Konstantin!

Regards,
Roman

On Fri, Mar 26, 2021 at 8:49 AM Arvid Heise  wrote:
>
> +1 from my side.
>  Thanks for proposing!
>
> On Fri, Mar 26, 2021 at 8:40 AM Robert Metzger  wrote:
>
> > +1 This is a very good improvement!
> >
> > Most likely we'll have to adjust the parameters a bit in the beginning,
> > once we've seen it in practice, but this will help keep the Jira clean.
> >
> > On Wed, Mar 24, 2021 at 11:15 AM Konstantin Knauf <
> > konstan...@ververica.com>
> > wrote:
> >
> > > > When/how are labels removed?
> > >
> > > Manually by the user who updates the ticket, I'd propose for now. Doing
> > > this automatically might be possible, but will make the bot quite a bit
> > > more complex, I think.  Let's say a ticket is labelled "stale-assigned"
> > by
> > > the bot. Then the assignee might update and remove the "stale-assigned"
> > > label. I would include this information in the comment that the bot posts
> > > like
> > >
> > > "This ticket has been marked "stale-assigned" as it has not received a
> > > public update for X days. Please provide an update in the next Y days and
> > > remove the "stale-assigned" label. Otherwise, the ticket will be
> > unassigned
> > > in Z days."
> > >
> > > On Wed, Mar 24, 2021 at 11:05 AM Chesnay Schepler 
> > > wrote:
> > >
> > > > When/how are labels removed?
> > > >
> > > > On 3/24/2021 10:06 AM, Matthias Pohl wrote:
> > > > > Thanks Konstantin for working on this. The ideas we collected in [1]
> > > > sound
> > > > > good. I'm looking forward to trying it out.
> > > > > +1 from my side.
> > > > >
> > > > > Best,
> > > > > Matthias
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/re7affbb1357ce4986a7770b0052c39c9a26ebd7cd0df3f15ed320781%40%3Cdev.flink.apache.org%3E
> > > > >
> > > > > On Wed, Mar 24, 2021 at 9:59 AM Konstantin Knauf 
> > > > wrote:
> > > > >
> > > > >> Hi everyone,
> > > > >>
> > > > >> based on the discussion in [1], I would like to start a vote on the
> > > > >> proposal as documented in [2].
> > > > >>
> > > > >> The vote will last for at least 72 hours, and will be accepted by a
> > > > >> consensus of active committers.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Konstantin
> > > > >>
> > > > >> [1]
> > > > >>
> > > > >>
> > > >
> > >
> > https://lists.apache.org/thread.html/re7affbb1357ce4986a7770b0052c39c9a26ebd7cd0df3f15ed320781%40%3Cdev.flink.apache.org%3E
> > > > >> [2]
> > > > >>
> > > > >>
> > > >
> > >
> > https://docs.google.com/document/d/19VmykDSn4BHgsCNTXtN89R7xea8e3cUIl-uivW8L6W8/edit#
> > > > >>
> > > > >> --
> > > > >>
> > > > >> Konstantin Knauf
> > > > >>
> > > > >> https://twitter.com/snntrable
> > > > >>
> > > > >> https://github.com/knaufk
> > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Konstantin Knauf | Head of Product
> > >
> > > +49 160 91394525
> > >
> > >
> > > Follow us @VervericaData Ververica 
> > >
> > >
> > > --
> > >
> > > Join Flink Forward  - The Apache Flink
> > > Conference
> > >
> > > Stream Processing | Event Driven | Real Time
> > >
> > > --
> > >
> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >
> > > --
> > > Ververica GmbH
> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
> > > Wehner
> > >
> >


Re: [VOTE] Release 1.12.4, release candidate #1

2021-05-14 Thread Roman Khachatryan
Thanks for managing the release.

+1 (non-binding)

I've checked the artifacts:
- Verified checksums and GPG of artifacts in [2]
- Built the sources locally without errors
- Verified that the source archives do not contain any binaries
- Checked that all POM files point to the same version.

Regards,
Roman

On Mon, May 10, 2021 at 11:34 PM Arvid Heise  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 1.12.4,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 476DAA5D1FF08189 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.12.4-rc1" [5],
> * website pull request listing the new release and adding announcement blog
> post [6].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Your friendly release manager Arvid
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12350110
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.4-rc1/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1421
> [5] https://github.com/apache/flink/releases/tag/release-1.12.4-rc1
> [6] https://github.com/apache/flink-web/pull/446


Re: [DISCUSS] Component labels in PR/commit messages

2021-05-20 Thread Roman Khachatryan
Thanks for raising this issue.

I agree with the above points.
One simple argument against labels is that they consume space in the
commit messages.

+1 to make labels optional

Regards,
Roman

On Thu, May 20, 2021 at 9:31 AM Robert Metzger  wrote:
>
> +1 to Till's proposal to update the wording.
>
> Regarding c) The guide [1] actually mentions a good heuristic for coming up
> with a label that is also suitable for newcomers: The maven module name
> where most of the changes are.
>
> [1]
> https://flink.apache.org/contributing/code-style-and-quality-pull-requests.html#commit-naming-conventions
>
> On Wed, May 19, 2021 at 11:14 AM Till Rohrmann  wrote:
>
> > I think a big problem with the component labels is that there is
> >
> > a) no defined set of labels
> > b) no way to enforce the usage of them
> > c) no easy way to figure out which label to use
> >
> > Due to these problems they are used very inconsistently in the project.
> >
> > I do agree with Arvid's observation that they are less and less often used
> > in new commits. Given this, we could think about adjusting our guidelines
> > to better reflect reality and make them "optional"/"nice-to-have" for
> > example.
> >
> > Cheers,
> > Till
> >
> > On Wed, May 19, 2021 at 10:52 AM Chesnay Schepler 
> > wrote:
> >
> > > For commit messages the labels are useful mostly when scanning the
> > > commit history, like searching for some commit that could've caused
> > > something /without knowing where that change was made/, because it
> > > enables you to quickly filter out commits by their label instead of
> > > having to read the entire title.
> > >
> > > I think in particular there is value in labeling documentation/build
> > > system changes; it allows me to worry less about the phrasing because I
> > > can assume the reader to have some context. For example,
> > >
> > > "[FLINK-X] Remove deprecated methods" vs "[FLINK-X][docs] Remove
> > > deprecated methods".
> > >
> > > You could of course argue to use "[FLINK-X] Remove deprecated methods
> > > from docs", but that's just a worse version of labeling.
> > >
> > >
> > > On 5/19/2021 10:31 AM, Arvid Heise wrote:
> > > > Dear devs,
> > > >
> > > > In the last couple of weeks, I have noticed that we are slacking a bit
> > on
> > > > the components in PR/commit messages. I'd like to gather some feedback
> > if
> > > > we still want to include them and if so, how we can improve the process
> > > of
> > > > finding the correct label.
> > > >
> > > > My personal opinion: So far, I have usually added the component because
> > > > it's in the coding guidelines. I have not really understood the
> > benefit.
> > > It
> > > > might be coming from a time where git support in IDE was lacking and it
> > > was
> > > > necessary to maintain an overview. I also have a hard time to find the
> > > > correct component at times; I often just repeat the component that I
> > find
> > > > in a blame. Nevertheless, I value consistency over personal taste and
> > > would
> > > > stick to the plan (and guide contributions towards it) if other devs
> > > > (especially committers) do it as well. But this has been causing some
> > > > friction in a couple of reviews for me.
> > > >
> > > > Could you please give your opinion on this matter? I think it's
> > important
> > > > to note that if long-term committers are not following it, it's really
> > > hard
> > > > for newer devs to follow that (git blame not helping in choosing the
> > > > component). Then we should remove it from the guidelines to make
> > > > contributions easier.
> > > >
> > > > Thanks
> > > >
> > > > Arvid
> > > >
> > >
> > >
> >


Re: [DISCUSS][Statebackend][Runtime] Changelog Statebackend Configuration Proposal

2021-05-31 Thread Roman Khachatryan
Hey Yuan, thanks for the proposal

I think Option 3 is the simplest to use and exposes less details than any other.
It's also consistent with the current way of configuring state
backends, as long as we treat change logging as a common feature
applicable to any state backend, like e.g.
state.backend.local-recovery.

Option 6 seems slightly less preferable as it exposes more details but
I think is the most viable alternative.

Regards,
Roman


On Mon, May 31, 2021 at 8:39 AM Yuan Mei  wrote:
>
> Hey all,
>
> We would like to start a discussion on how to enable/config Changelog
> Statebakcend.
>
> As part of FLIP-158[1], Changelog state backend wraps on top of existing
> state backend (HashMapStateBackend, EmbeddedRocksDBStateBackend and may
> expect more) and delegates state changes to the underlying state backends.
> This thread is to discuss the problem of how Changelog StateBackend should
> be enabled and configured.
>
> Proposed options to enable/config state changelog is listed below:
>
> Option 1: Enable Changelog Statebackend through a Boolean Flag
>
> Option 2: Enable Changelog Statebackend through a Boolean Flag + a Special
> Case
>
> Option 3: Enable Changelog Statebackend through a Boolean Flag + W/O
> ChangelogStateBackend Exposed
>
> Option 4: Explicit Nested Configuration + “changelog.inner” prefix for
> inner backend
>
> Option 5: Explicit Nested Configuration + inner state backend configuration
> unchanged
>
> Option 6: Config Changelog and Inner Statebackend All-Together
>
> Details of each option can be found here:
> https://docs.google.com/document/d/13AaCf5fczYTDHZ4G1mgYL685FqbnoEhgo0cdwuJlZmw/edit?usp=sharing
>
> When considering these options, please consider these four dimensions:
> 1 Consistency
> API/config should follow a consistent model and should not have
> contradicted logic beneath
> 2 Simplicity
> API should be easy to use and not introduce too much burden on users
> 3. Explicity
> API/config should not contain implicit assumptions and should be intuitive
> to users
> 4. Extensibility
> With foreseen future, whether the current setting can be easily extended
>
> Please let us know what do you think and please keep the discussion in this
> mailing thread.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
>
> Best
> Yuan


Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

2020-08-07 Thread Roman Khachatryan
Hi Thomas,

Thanks for your reply!

I think you are right, we can remove this sleep and improve KinesisProducer.
Probably, it's snapshotState can also be sped up by forcing records flush
more often.
Do you see that 30s checkpointing duration is caused by KinesisProducer (or
maybe other operators)?

I'd also like to understand the reason behind this increase in checkpoint
frequency.
Can you please share these values:
 - execution.checkpointing.min-pause
 - execution.checkpointing.max-concurrent-checkpoints
 - execution.checkpointing.timeout

And what is the "new" observed checkpoint frequency (or how many
checkpoints are created) compared to older versions?


On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise  wrote:

> Hi Roman,
>
> Indeed there are more frequent checkpoints with this change! The
> application was configured to checkpoint every 10s. With 1.10 ("good
> commit"), that leads to fewer completed checkpoints compared to 1.11 ("bad
> commit"). Just to be clear, the only difference between the two runs was
> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>
> Since the sync part of checkpoints with the Kinesis producer always takes
> ~30 seconds, the 10s configured checkpoint frequency really had no effect
> before 1.11. I confirmed that both commits perform comparably by setting
> the checkpoint frequency and min pause to 60s.
>
> I still have to verify with the final 1.11.0 release commit.
>
> It's probably good to take a look at the Kinesis producer. Is it really
> necessary to have 500ms sleep time? What's responsible for the ~30s
> duration in snapshotState?
>
> As things stand it doesn't make sense to use checkpoint intervals < 30s
> when using the Kinesis producer.
>
> Thanks,
> Thomas
>
> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan 
> wrote:
>
> > Hi Thomas,
> >
> > Thanks a lot for the analysis.
> >
> > The first thing that I'd check is whether checkpoints became more
> frequent
> > with this commit (as each of them adds at least 500ms if there is at
> least
> > one not sent record, according to FlinkKinesisProducer.snapshotState).
> >
> > Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
> > first "bad" commits)?
> >
> > On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise 
> > wrote:
> >
> > > I run git bisect and the first commit that shows the regression is:
> > >
> > >
> > >
> >
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
> > >
> > >
> > > On Thu, Jul 23, 2020 at 6:46 PM Kurt Young  wrote:
> > >
> > > > From my experience, java profilers are sometimes not accurate enough
> to
> > > > find out the performance regression
> > > > root cause. In this case, I would suggest you try out intel vtune
> > > amplifier
> > > > to watch more detailed metrics.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise  wrote:
> > > >
> > > > > The cause of the issue is all but clear.
> > > > >
> > > > > Previously I had mentioned that there is no suspect change to the
> > > Kinesis
> > > > > connector and that I had reverted the AWS SDK change to no effect.
> > > > >
> > > > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
> > > another
> > > > > regression in the previous release and is present before and after.
> > > > >
> > > > > I repeated the run with 1.11.0 core and downgraded the entire
> Kinesis
> > > > > connector to 1.10.1: Nothing changes, i.e. the regression is still
> > > > present.
> > > > > Therefore we will need to look elsewhere for the root cause.
> > > > >
> > > > > Regarding the time spent in snapshotState, repeat runs reveal a
> wide
> > > > range
> > > > > for both versions, 1.10 and 1.11. So again this is nothing pointing
> > to
> > > a
> > > > > root cause.
> > > > >
> > > > > At this point, I have no ideas remaining other than doing a bisect
> to
> > > > find
> > > > > the culprit. Any other suggestions?
> > > > >
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang <
> wangzhijiang...@aliyun.com
> > >

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

2020-08-08 Thread Roman Khachatryan
Hi Thomas,

Thanks a lot for the detailed information.

I think the problem is in CheckpointCoordinator. It stores the last
checkpoint completion time after checking queued requests.
I've created a ticket to fix this:
https://issues.apache.org/jira/browse/FLINK-18856


On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise  wrote:

> Just another update:
>
> The duration of snapshotState is capped by the Kinesis
> producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
> does not contribute to the observed behavior.
>
> I guess the open question is why, with the same settings, is 1.11 since
> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?
>
>
> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise  wrote:
>
>> Hi Roman,
>>
>> Here are the checkpoint summaries for both commits:
>>
>>
>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>>
>> The config:
>>
>> CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>> checkpointConfig.setCheckpointInterval(*10_000*);
>> checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>>
>> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>> checkpointConfig.setCheckpointTimeout(600_000);
>> checkpointConfig.setMaxConcurrentCheckpoints(1);
>> checkpointConfig.setFailOnCheckpointingErrors(true);
>>
>> The values marked bold when changed to *60_000* make the symptom
>> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>>
>> I will take a look at the sleep time issue.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan 
>> wrote:
>>
>>> Hi Thomas,
>>>
>>> Thanks for your reply!
>>>
>>> I think you are right, we can remove this sleep and improve
>>> KinesisProducer.
>>> Probably, it's snapshotState can also be sped up by forcing records
>>> flush more often.
>>> Do you see that 30s checkpointing duration is caused by KinesisProducer
>>> (or maybe other operators)?
>>>
>>> I'd also like to understand the reason behind this increase in
>>> checkpoint frequency.
>>> Can you please share these values:
>>>  - execution.checkpointing.min-pause
>>>  - execution.checkpointing.max-concurrent-checkpoints
>>>  - execution.checkpointing.timeout
>>>
>>> And what is the "new" observed checkpoint frequency (or how many
>>> checkpoints are created) compared to older versions?
>>>
>>>
>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise  wrote:
>>>
>>>> Hi Roman,
>>>>
>>>> Indeed there are more frequent checkpoints with this change! The
>>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>>> ("bad
>>>> commit"). Just to be clear, the only difference between the two runs was
>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>>
>>>> Since the sync part of checkpoints with the Kinesis producer always
>>>> takes
>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
>>>> effect
>>>> before 1.11. I confirmed that both commits perform comparably by setting
>>>> the checkpoint frequency and min pause to 60s.
>>>>
>>>> I still have to verify with the final 1.11.0 release commit.
>>>>
>>>> It's probably good to take a look at the Kinesis producer. Is it really
>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>>> duration in snapshotState?
>>>>
>>>> As things stand it doesn't make sense to use checkpoint intervals < 30s
>>>> when using the Kinesis producer.
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>> On Sat, Aug 1, 2020 at 2:53 PM Roman Khachatryan <
>>>> ro...@data-artisans.com>
>>>> wrote:
>>>>
>>>> > Hi Thomas,
>>>> >
>>>> > Thanks a lot for the analysis.
>>>> >
>>>> > The first thing that I'd check is whether checkpoints became more
>>>> frequent
>>>> > with this commit (as each of them a

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

2020-08-13 Thread Roman Khachatryan
Hi Thomas,

The fix is now merged to master and to release-1.11.
So if you'd like you can check if it solves your problem (it would be
helpful for us too).

On Sat, Aug 8, 2020 at 9:26 AM Roman Khachatryan 
wrote:

> Hi Thomas,
>
> Thanks a lot for the detailed information.
>
> I think the problem is in CheckpointCoordinator. It stores the last
> checkpoint completion time after checking queued requests.
> I've created a ticket to fix this:
> https://issues.apache.org/jira/browse/FLINK-18856
>
>
> On Sat, Aug 8, 2020 at 5:25 AM Thomas Weise  wrote:
>
>> Just another update:
>>
>> The duration of snapshotState is capped by the Kinesis
>> producer's "RecordTtl" setting (default 30s). The sleep time in flushSync
>> does not contribute to the observed behavior.
>>
>> I guess the open question is why, with the same settings, is 1.11 since
>> commit 355184d69a8519d29937725c8d85e8465d7e3a90 processing more checkpoints?
>>
>>
>> On Fri, Aug 7, 2020 at 9:15 AM Thomas Weise  wrote:
>>
>>> Hi Roman,
>>>
>>> Here are the checkpoint summaries for both commits:
>>>
>>>
>>> https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit#slide=id.g86d15b2fc7_0_0
>>>
>>> The config:
>>>
>>> CheckpointConfig checkpointConfig = env.getCheckpointConfig();
>>>
>>> checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
>>> checkpointConfig.setCheckpointInterval(*10_000*);
>>> checkpointConfig.setMinPauseBetweenCheckpoints(*10_000*);
>>>
>>> checkpointConfig.enableExternalizedCheckpoints(DELETE_ON_CANCELLATION);
>>> checkpointConfig.setCheckpointTimeout(600_000);
>>> checkpointConfig.setMaxConcurrentCheckpoints(1);
>>> checkpointConfig.setFailOnCheckpointingErrors(true);
>>>
>>> The values marked bold when changed to *60_000* make the symptom
>>> disappear. I meanwhile also verified that with the 1.11.0 release commit.
>>>
>>> I will take a look at the sleep time issue.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Fri, Aug 7, 2020 at 1:44 AM Roman Khachatryan <
>>> ro...@data-artisans.com> wrote:
>>>
>>>> Hi Thomas,
>>>>
>>>> Thanks for your reply!
>>>>
>>>> I think you are right, we can remove this sleep and improve
>>>> KinesisProducer.
>>>> Probably, it's snapshotState can also be sped up by forcing records
>>>> flush more often.
>>>> Do you see that 30s checkpointing duration is caused by KinesisProducer
>>>> (or maybe other operators)?
>>>>
>>>> I'd also like to understand the reason behind this increase in
>>>> checkpoint frequency.
>>>> Can you please share these values:
>>>>  - execution.checkpointing.min-pause
>>>>  - execution.checkpointing.max-concurrent-checkpoints
>>>>  - execution.checkpointing.timeout
>>>>
>>>> And what is the "new" observed checkpoint frequency (or how many
>>>> checkpoints are created) compared to older versions?
>>>>
>>>>
>>>> On Fri, Aug 7, 2020 at 4:49 AM Thomas Weise  wrote:
>>>>
>>>>> Hi Roman,
>>>>>
>>>>> Indeed there are more frequent checkpoints with this change! The
>>>>> application was configured to checkpoint every 10s. With 1.10 ("good
>>>>> commit"), that leads to fewer completed checkpoints compared to 1.11
>>>>> ("bad
>>>>> commit"). Just to be clear, the only difference between the two runs
>>>>> was
>>>>> the commit 355184d69a8519d29937725c8d85e8465d7e3a90
>>>>>
>>>>> Since the sync part of checkpoints with the Kinesis producer always
>>>>> takes
>>>>> ~30 seconds, the 10s configured checkpoint frequency really had no
>>>>> effect
>>>>> before 1.11. I confirmed that both commits perform comparably by
>>>>> setting
>>>>> the checkpoint frequency and min pause to 60s.
>>>>>
>>>>> I still have to verify with the final 1.11.0 release commit.
>>>>>
>>>>> It's probably good to take a look at the Kinesis producer. Is it really
>>>>> necessary to have 500ms sleep time? What's responsible for the ~30s
>>>>> duration in snapshotStat

Re: Re: [ANNOUNCE] New PMC member: Arvid Heise

2021-06-16 Thread Roman Khachatryan
Congratulations!

Regards,
Roman

On Thu, Jun 17, 2021 at 5:56 AM Xingbo Huang  wrote:
>
> Congratulations, Arvid!
>
> Best,
> Xingbo
>
> Yun Tang  于2021年6月17日周四 上午10:49写道:
>
> > Congratulations, Arvid
> >
> > Best
> > Yun Tang
> > 
> > From: Yun Gao 
> > Sent: Thursday, June 17, 2021 10:46
> > To: Jingsong Li ; dev 
> > Subject: Re: Re: [ANNOUNCE] New PMC member: Arvid Heise
> >
> > Congratulations, Arvid!
> >
> > Best,
> > Yun
> >
> >
> > --
> > Sender:Jingsong Li
> > Date:2021/06/17 10:41:29
> > Recipient:dev
> > Theme:Re: [ANNOUNCE] New PMC member: Arvid Heise
> >
> > Congratulations, Arvid!
> >
> > Best,
> > Jingsong
> >
> > On Thu, Jun 17, 2021 at 6:52 AM Matthias J. Sax  wrote:
> >
> > > Congrats!
> > >
> > > On 6/16/21 6:06 AM, Leonard Xu wrote:
> > > > Congratulations, Arvid!
> > > >
> > > >
> > > >> 在 2021年6月16日,20:08,Till Rohrmann  写道:
> > > >>
> > > >> Congratulations, Arvid!
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Wed, Jun 16, 2021 at 1:47 PM JING ZHANG 
> > > wrote:
> > > >>
> > > >>> Congratulations, Arvid!
> > > >>>
> > > >>> Nicholas Jiang  于2021年6月16日周三 下午7:25写道:
> > > >>>
> > >  Congratulations, Arvid!
> > > 
> > > 
> > > 
> > >  --
> > >  Sent from:
> > > >>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > 
> > > >>>
> > > >
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >


Re: Re: [ANNOUNCE] New PMC member: Xintong Song

2021-06-18 Thread Roman Khachatryan
Congratulations!

Regards,
Roman

On Fri, Jun 18, 2021 at 4:40 AM Yu Li  wrote:
>
> Congratulations, Xintong!
>
> Best Regards,
> Yu
>
>
> On Thu, 17 Jun 2021 at 15:23, Yuan Mei  wrote:
>
> > Congratulations, Xintong :-)
> >
> > On Thu, Jun 17, 2021 at 11:57 AM Xingbo Huang  wrote:
> >
> > > Congratulations, Xintong!
> > >
> > > Best,
> > > Xingbo
> > >
> > > Yun Gao  于2021年6月17日周四 上午10:46写道:
> > >
> > > > Congratulations, Xintong!
> > > >
> > > > Best,
> > > > Yun
> > > >
> > > >
> > > > --
> > > > Sender:Jingsong Li
> > > > Date:2021/06/17 10:41:22
> > > > Recipient:dev
> > > > Theme:Re: [ANNOUNCE] New PMC member: Xintong Song
> > > >
> > > > Congratulations, Xintong!
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Thu, Jun 17, 2021 at 10:26 AM Yun Tang  wrote:
> > > >
> > > > > Congratulations, Xintong!
> > > > >
> > > > > Best
> > > > > Yun Tang
> > > > > 
> > > > > From: Leonard Xu 
> > > > > Sent: Wednesday, June 16, 2021 21:05
> > > > > To: dev (dev@flink.apache.org) 
> > > > > Subject: Re: [ANNOUNCE] New PMC member: Xintong Song
> > > > >
> > > > >
> > > > > Congratulations, Xintong!
> > > > >
> > > > >
> > > > > Best,
> > > > > Leonard
> > > > > > 在 2021年6月16日,20:07,Till Rohrmann  写道:
> > > > > >
> > > > > > Congratulations, Xintong!
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Wed, Jun 16, 2021 at 1:47 PM JING ZHANG 
> > > > wrote:
> > > > > >
> > > > > >> Congratulations, Xintong!
> > > > > >>
> > > > > >>
> > > > > >> Jiayi Liao  于2021年6月16日周三 下午7:30写道:
> > > > > >>
> > > > > 
> > > > >  <
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/>
> > > > >  Congratulations Xintong!
> > > > > 
> > > > >  On Wed, Jun 16, 2021 at 7:24 PM Nicholas Jiang <
> > > programg...@163.com
> > > > >
> > > > >  wrote:
> > > > > 
> > > > > > Congratulations, Xintong!
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sent from:
> > > > > >
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > > > 
> > > > > 
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > > >
> > >
> >


Re: [VOTE] Migrating Test Framework to JUnit 5

2021-07-05 Thread Roman Khachatryan
+1 (binding)

Regards,
Roman

On Mon, Jul 5, 2021 at 9:47 AM Chesnay Schepler  wrote:
>
> +1 (binding)
>
> On 05/07/2021 09:45, Arvid Heise wrote:
> > +1 (binding)
> >
> > On Wed, Jun 30, 2021 at 5:56 PM Qingsheng Ren  wrote:
> >
> >> Dear devs,
> >>
> >>
> >> As discussed in the thread[1], I’d like to start a vote on migrating the
> >> test framework of Flink project to JUnit 5.
> >>
> >>
> >> JUnit 5[2] provides many exciting new features that can simplify the
> >> development of test cases and make our test structure cleaner, such as
> >> pluggable extension models (replacing rules such as
> >> TestLogger/MiniCluster), improved parameterized test, annotation support
> >> and nested/dynamic test.
> >>
> >>
> >> The migration path towards JUnit 5 would be:
> >>
> >>
> >> 1. Remove JUnit 4 dependency and introduce junit5-vintage-engine in the
> >> project
> >>
> >>
> >>  The vintage engine will keep all existing JUnit4-style cases still
> >> valid. Since classes of JUnit 4 and 5 are located under different packages,
> >> there won’t be conflict having JUnit 4 cases in the project.
> >>
> >>
> >> 2. Rewrite JUnit 4 rules in JUnit 5 extension style (~10 rules)
> >>
> >>
> >> 3. Migrate all existing tests to JUnit 5
> >>
> >>
> >>  This would be a giant commit similar to code reformatting and could be
> >> merged after cutting the 1.14 release branch. There are many migration
> >> examples and experiences to refer to, also Intellij IDEA provides tools for
> >> refactoring.
> >>
> >>
> >> 4. Ban JUnit 4 imports in CheckStyle
> >>
> >>
> >>  Some modules ilke Testcontainers still require JUnit 4 in the
> >> classpath, and JUnit 4 could still appear as transitive dependency. Banning
> >> JUnit 4 imports can avoid developers mistakenly using JUnit 4 and split the
> >> project into 4 & 5 again.
> >>
> >>
> >> 5. Remove vintage runner and some cleanup
> >>
> >>
> >>
> >> This vote will last for at least 72 hours, following the consensus voting
> >> process.
> >>
> >>
> >> Thanks!
> >>
> >>
> >> [1]
> >>
> >> https://lists.apache.org/thread.html/r6c8047c7265b8a9f2cb3ef6d6153dd80b94d36ebb03daccf36ab4940%40%3Cdev.flink.apache.org%3E
> >>
> >> [2] https://junit.org/junit5
> >>
> >> --
> >> Best Regards,
> >>
> >> *Qingsheng Ren*
> >>
> >> Email: renqs...@gmail.com
> >>
>


Re: [ANNOUNCE] New Apache Flink Committer - Yang Wang

2021-07-07 Thread Roman Khachatryan
Congrats!

Regards,
Roman


On Wed, Jul 7, 2021 at 8:28 AM Qingsheng Ren  wrote:
>
> Congratulations Yang!
>
> --
> Best Regards,
>
> Qingsheng Ren
> Email: renqs...@gmail.com
> On Jul 7, 2021, 2:26 PM +0800, Rui Li , wrote:
> > Congratulations Yang ~
> >
> > On Wed, Jul 7, 2021 at 1:01 PM Benchao Li  wrote:
> >
> > > Congratulations!
> > >
> > > Peter Huang  于2021年7月7日周三 下午12:54写道:
> > >
> > > > Congratulations, Yang.
> > > >
> > > > Best Regards
> > > > Peter Huang
> > > >
> > > > On Tue, Jul 6, 2021 at 9:48 PM Dian Fu  wrote:
> > > >
> > > > > Congratulations, Yang,
> > > > >
> > > > > Regards,
> > > > > Dian
> > > > >
> > > > > > 2021年7月7日 上午10:46,Jary Zhen  写道:
> > > > > >
> > > > > > Congratulations, Yang Wang.
> > > > > >
> > > > > > Best
> > > > > > Jary
> > > > > >
> > > > > > Yun Gao  于2021年7月7日周三 上午10:38写道:
> > > > > >
> > > > > > > Congratulations Yang!
> > > > > > >
> > > > > > > Best,
> > > > > > > Yun
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sender:Jark Wu
> > > > > > > Date:2021/07/07 10:20:27
> > > > > > > Recipient:dev
> > > > > > > Cc:Yang Wang; 
> > > > > > > Theme:Re: [ANNOUNCE] New Apache Flink Committer - Yang Wang
> > > > > > >
> > > > > > > Congratulations Yang Wang!
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Wed, 7 Jul 2021 at 10:09, Xintong Song 
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > On behalf of the PMC, I'm very happy to announce Yang Wang as a 
> > > > > > > > new
> > > > > Flink
> > > > > > > > committer.
> > > > > > > >
> > > > > > > > Yang has been a very active contributor for more than two years,
> > > > mainly
> > > > > > > > focusing on Flink's deployment components. He's a main 
> > > > > > > > contributor
> > > > and
> > > > > > > > maintainer of Flink's native Kubernetes deployment and native
> > > > > Kubernetes
> > > > > > > > HA. He's also very active on the mailing lists, participating in
> > > > > > > > discussions and helping with user questions.
> > > > > > > >
> > > > > > > > Please join me in congratulating Yang Wang for becoming a Flink
> > > > > > > committer!
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> >
> >
> > --
> > Best regards!
> > Rui Li


Re: [ANNOUNCE] New PMC member: Guowei Ma

2021-07-07 Thread Roman Khachatryan
Congratulations!

Regards,
Roman

On Wed, Jul 7, 2021 at 8:24 AM Rui Li  wrote:
>
> Congratulations Guowei!
>
> On Wed, Jul 7, 2021 at 1:01 PM Benchao Li  wrote:
>
> > Congratulations!
> >
> > Dian Fu  于2021年7月7日周三 下午12:46写道:
> >
> > > Congratulations, Guowei!
> > >
> > > Regards,
> > > Dian
> > >
> > > > 2021年7月7日 上午10:37,Yun Gao  写道:
> > > >
> > > > Congratulations Guowei!
> > > >
> > > >
> > > > Best,
> > > > Yun
> > > >
> > > >
> > > > --
> > > > Sender:JING ZHANG
> > > > Date:2021/07/07 10:33:51
> > > > Recipient:dev
> > > > Theme:Re: [ANNOUNCE] New PMC member: Guowei Ma
> > > >
> > > > Congratulations,  Guowei Ma!
> > > >
> > > > Best regards,
> > > > JING ZHANG
> > > >
> > > > Zakelly Lan  于2021年7月7日周三 上午10:30写道:
> > > >
> > > >> Congratulations, Guowei!
> > > >>
> > > >> Best,
> > > >> Zakelly
> > > >>
> > > >> On Wed, Jul 7, 2021 at 10:24 AM tison  wrote:
> > > >>
> > > >>> Congrats! NB.
> > > >>>
> > > >>> Best,
> > > >>> tison.
> > > >>>
> > > >>>
> > > >>> Jark Wu  于2021年7月7日周三 上午10:20写道:
> > > >>>
> > >  Congratulations Guowei!
> > > 
> > >  Best,
> > >  Jark
> > > 
> > >  On Wed, 7 Jul 2021 at 09:54, XING JIN 
> > > wrote:
> > > 
> > > > Congratulations, Guowei~ !
> > > >
> > > > Best,
> > > > Jin
> > > >
> > > > Xintong Song  于2021年7月7日周三 上午9:37写道:
> > > >
> > > >> Congratulations, Guowei~!
> > > >>
> > > >> Thank you~
> > > >>
> > > >> Xintong Song
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Jul 7, 2021 at 9:31 AM Qingsheng Ren 
> > >  wrote:
> > > >>
> > > >>> Congratulations Guowei!
> > > >>>
> > > >>> --
> > > >>> Best Regards,
> > > >>>
> > > >>> Qingsheng Ren
> > > >>> Email: renqs...@gmail.com
> > > >>> 2021年7月7日 +0800 09:30 Leonard Xu ,写道:
> > >  Congratulations! Guowei Ma
> > > 
> > >  Best,
> > >  Leonard
> > > 
> > > > ÔÚ 2021Äê7ÔÂ6ÈÕ£¬21:56£¬Kurt Young  дµÀ£º
> > > >
> > > > Hi all!
> > > >
> > > > I'm very happy to announce that Guowei Ma has joined the
> > > >> Flink
> > >  PMC!
> > > >
> > > > Congratulations and welcome Guowei!
> > > >
> > > > Best,
> > > > Kurt
> > > 
> > > >>>
> > > >>
> > > >
> > > 
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>
>
> --
> Best regards!
> Rui Li


Re: [ANNOUNCE] New Apache Flink Committer - Yuan Mei

2021-07-08 Thread Roman Khachatryan
Congratulations Yuan!

Regards,
Roman

On Thu, Jul 8, 2021 at 6:02 AM Yang Wang  wrote:
>
> Congratulations Yuan!
>
> Best,
> Yang
>
> XING JIN  于2021年7月8日周四 上午11:46写道:
>
> > Congratulations Yuan~!
> >
> > Roc Marshal  于2021年7月8日周四 上午11:28写道:
> >
> > > Congratulations, Yuan!
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2021-07-08 01:21:40, "Yu Li"  wrote:
> > > >Hi all,
> > > >
> > > >On behalf of the PMC, I’m very happy to announce Yuan Mei as a new Flink
> > > >committer.
> > > >
> > > >Yuan has been an active contributor for more than two years, with code
> > > >contributions on multiple components including kafka connectors,
> > > >checkpointing, state backends, etc. Besides, she has been actively
> > > involved
> > > >in community activities such as helping manage releases, discussing
> > > >questions on dev@list, supporting users and giving talks at
> > conferences.
> > > >
> > > >Please join me in congratulating Yuan for becoming a Flink committer!
> > > >
> > > >Cheers,
> > > >Yu
> > >
> >


Re: [DISCUSS] FLIP-191: Extend unified Sink interface to support small file compaction

2021-11-09 Thread Roman Khachatryan
Hi everyone,

Thanks for the proposal and the discussion, I have some remarks:
(I'm not very familiar with the new Sink API but I thought about the
same problem in context of the changelog state backend)

1. Merging artifacts from multiple checkpoints would apparently
require multiple concurrent checkpoints (otherwise, a new checkpoint
won't be started before completing the previous one; and the previous
one can't be completed before durably storing the artifacts). However,
concurrent checkpoints are currently not supported with Unaligned
checkpoints (this is besides increasing e2e-latency).

2. Asynchronous merging in an aggregator would require some resolution
logic on recovery, so that a merged artifact can be used if the
original one was deleted. Otherwise, wouldn't recovery fail because
some artifacts are missing?
We could also defer deletion until the "compacted" checkpoint is
subsumed - but isn't it too late, as it will be deleted anyways once
subsumed?

3. Writing small files, then reading and merging them for *every*
checkpoint seems worse than only reading them on recovery. I guess I'm
missing some cases of reading, so to me it would make sense to mention
these cases explicitly in the FLIP motivation section.

4. One way to avoid write-read-merge is by wrapping SinkWriter with
another one, which would buffer input elements in a temporary storage
(e.g. local file) until a threshold is reached; after that, it would
invoke the original SinkWriter. And if a checkpoint barrier comes in
earlier, it would send written data to some aggregator. It will
increase checkpoint delay (async phase) compared to the current Flink;
but not compared to the write-read-merge solution, IIUC.
Then such "BufferingSinkWriters" could aggregate input elements from
each other, potentially recursively (I mean something like
https://cwiki.apache.org/confluence/download/attachments/173082889/DSTL-DFS-DAG.png
)

5. Reducing the number of files by reducing aggregator parallelism as
opposed to merging on reaching size threshold will likely be less
optimal and more difficult to configure. OTH, thresholds might be more
difficult to implement and (with recursive merging) would incur higher
latency. Maybe that's also something to decide explicitly or at least
mention in the FLIP.



Regards,
Roman


On Tue, Nov 9, 2021 at 5:23 AM Reo Lei  wrote:
>
> Hi Fabian,
>
> Thanks for drafting the FLIP and trying to support small file compaction. I
> think this feature is very urgent and valuable for users(at least for me).
>
> Currently I am trying to support streaming rewrite(compact) for Iceberg on
> PR#3323 . As Steven mentioned,
> Iceberg sink and compact data through the following steps:
> Step-1: Some parallel data writer(sinker) to write streaming data as files.
> Step-2: A single parallelism data files committer to commit the completed
> files as soon as possible to make them available.
> Step-3: Some parallel file rewriter(compactor) to collect committed files
> from multiple checkpoints, and rewriter(compact) them together once the
> total file size or number of files reach the threshold.
> Step-4: A single parallelism rewrite(compact) result committer to commit
> the rewritten(compacted) files to replace the old files and make them
> available.
>
>
> If Flink want to support small file compaction, some key point I think is
> necessary:
>
> 1, Compact files from multiple checkpoints.
> I totally agree with Jingsong, because completed file size usually could
> not reach the threshold in a single checkpoint. Especially for partitioned
> table, we need to compact the files of each partition, but usually the file
> size of each partition will be different and may not reach the merge
> threshold. If we compact these files, in a single checkpoint, regardless of
> whether the total file size reaches the threshold, then the value of
> compacting will be diminished and we will still get small files because
> these compacted files are not reach to target size. So we need the
> compactor to collect committed files from multiple checkpoints and compact
> them until they reach the threshold.
>
> 2, Separate write phase and compact phase.
> Users usually hope the data becomes available as soon as possible, and the
>  end-to-end latency is very important. I think we need to separate the
> write and compact phase. For the write phase, there include the Step-1
> and Step-2, we sink data as file and commit it pre checkpoint and regardless
> of whether the file size it is. That could ensure the data will be
> available ASAP. For the compact phase, there include the Step-3
> and Step-4,  the compactor should collect committed files from multiple
> checkpoints and compact them asynchronously once they reach the threshold,
> and the compact committer will commit the  compaction result in the next
> checkpoint. We compact the committed files asynchronously because we don't
> want the compaction to affect the dat

Re: [ANNOUNCE] New Apache Flink Committer - Fabian Paul

2021-11-15 Thread Roman Khachatryan
Congratulations Fabian!

Regards,
Roman

On Mon, Nov 15, 2021 at 3:26 PM Dawid Wysakowicz  wrote:
>
> Congratulations Fabian!
>
> On 15/11/2021 15:22, Marios Trivyzas wrote:
> > Congrats Fabian!
> >
> > On Mon, Nov 15, 2021 at 3:02 PM Francesco Guardiani 
> > 
> > wrote:
> >
> >> Congratulations Fabian!
> >>
> >> On Mon, Nov 15, 2021 at 2:29 PM Yun Gao 
> >> wrote:
> >>
> >>> Congratulations Fabian!
> >>>
> >>> Best,
> >>> Yun
> >>>  --Original Mail --
> >>> Sender:Ingo Bürk 
> >>> Send Date:Mon Nov 15 21:27:36 2021
> >>> Recipients:dev 
> >>> CC: , 
> >>> Subject:Re: [ANNOUNCE] New Apache Flink Committer - Fabian Paul
> >>> Congratulations, Fabian!
> >>>
> >>> On Mon, Nov 15, 2021 at 2:17 PM Arvid Heise  wrote:
> >>>
>  Hi everyone,
> 
>  On behalf of the PMC, I'm very happy to announce Fabian Paul as a new
> >>> Flink
>  committer.
> 
>  Fabian Paul has been actively improving the connector ecosystem by
>  migrating Kafka and ElasticSearch to the Sink interface and is
> >> currently
>  driving FLIP-191 [1] to tackle the sink compaction issue. While he is
>  active on the project (authored 70 PRs and reviewed 60), it's also
> >> worth
>  highlighting that he has also been guiding external efforts, such as
> >> the
>  DeltaLake Flink connector or the Pinot sink in Bahir.
> 
>  Please join me in congratulating Fabian for becoming a Flink committer!
> 
>  Best,
> 
>  Arvid
> 
>  [1]
> 
> 
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction
> >
>


Re: [DISCUSS] FLIP-193: Snapshots ownership

2021-11-22 Thread Roman Khachatryan
Hi,

Thanks for the proposal Dawid, I have some questions and remarks:

1. How will stop-with-savepoint be handled?
Shouldn't side effects be enforced in this case? (i.e. send
notifyCheckpointComplete)

2. Instead of forcing re-upload, can we "inverse control" in no-claim mode?
Anyways, any external tool will have to poll Flink API waiting for the
next (full) checkpoint, before deleting the retained checkpoint,
right?
Instead, we can provide an API which tells whether the 1st checkpoint
is still in use (and not force re-upload it).

Under the hood, it can work like this:
- for the checkpoint Flink recovers from, remember all shared state
handles it is adding
- when unregistering shared state handles, remove them from the set above
- when the set becomes empty the 1st checkpoint can be deleted externally

Besides not requiring re-upload, it seems much simpler and less invasive.
On the downside, state deletion can be delayed; but I think this is a
reasonable trade-off.

3. Alternatively, re-upload not necessarily on 1st checkpoint, but
after a configured number of checkpoints?
There is a high chance that after some more checkpoints, initial state
will not be used (because of compaction),
so backends won't have to re-upload anything (or small part).

4. Re-uploaded artifacts must not be deleted on checkpoin abortion
This should be addressed in https://issues.apache.org/jira/browse/FLINK-24611.
If not, I think the FLIP should consider this case.

5. Enforcing re-upload by a single task and Changelog state backend
With Changelog state backend, a file can be shared by multiple operators.
Therefore, getIntersection() is irrelevant here, because operators
might not be sharing any key groups.
(so we'll have to analyze "raw" file usage I think).

6. Enforcing re-upload by a single task and skew
If we use some greedy logic like subtask 0 always re-uploads then it
might be overloaded.
So we'll have to obtain a full list of subtasks first (then probably
choose randomly or round-robin).
However, that requires rebuilding Task snapshot, which is doable but
not trivial (which I think supports "reverse API option").

7. I think it would be helpful to list file systems / object stores
that support "fast" copy (ideally with latency numbers).

Regards,
Roman

On Mon, Nov 22, 2021 at 9:24 AM Yun Gao  wrote:
>
> Hi,
>
> Very thanks Dawid for proposing the FLIP to clarify the ownership for the
> states. +1 for the overall changes since it makes the behavior clear and
> provide users a determined method to finally cleanup savepoints / retained 
> checkpoints.
>
> Regarding the changes to the public interface, it seems currently the changes 
> are all bound
> to the savepoint, but from the FLIP it seems perhaps we might also need to 
> support the claim declaration
> for retained checkpoints like in the cli side[1] ? If so, then might it be 
> better to change the option name
> from `execution.savepoint.restore-mode` to something like 
> `execution.restore-mode`?
>
> Best,
> Yun
>
>
> [1] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpoints/#resuming-from-a-retained-checkpoint
>
>
> --
> From:Konstantin Knauf 
> Send Time:2021 Nov. 19 (Fri.) 16:00
> To:dev 
> Subject:Re: [DISCUSS] FLIP-193: Snapshots ownership
>
> Hi Dawid,
>
> Thanks for working on this FLIP. Clarifying the differences and
> guarantees around savepoints and checkpoints will make it easier and safer
> for users and downstream projects and platforms to work with them.
>
> +1 to the changing the current (undefined) behavior when recovering from
> retained checkpoints. Users can now choose between claiming and not
> claiming, which I think will make the current mixed behavior obsolete.
>
> Cheers,
>
> Konstantin
>
> On Fri, Nov 19, 2021 at 8:19 AM Dawid Wysakowicz 
> wrote:
>
> > Hi devs,
> >
> > I'd like to bring up for a discussion a proposal to clean up ownership
> > of snapshots, both checkpoints and savepoints.
> >
> > The goal here is to make it clear who is responsible for deleting
> > checkpoints/savepoints files and when can that be done in a safe manner.
> >
> > Looking forward for your feedback!
> >
> > Best,
> >
> > Dawid
> >
> > [1] https://cwiki.apache.org/confluence/x/bIyqCw
> >
> >
> >
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>


Re: [DISCUSS] FLIP-193: Snapshots ownership

2021-11-22 Thread Roman Khachatryan
Thanks Dawid,

Regarding clarity,
I think that all proposals require waiting for some event: re-upload /
checkpoint completion / api response.
But with the current one, there is an assumption: "initial checkpoint
can be deleted once a new one completes" (instead of just "initial
checkpoint can be deleted once the API says it can be deleted").
So I think it's actually more clear to offer this explicit API and rely on it.

Regarding delaying the deletion,
I agree that it can delay deletion, but how important is it?
Checkpoints are usually stored on relatively cheap storage like S3, so
some delay shouldn't be an issue (especially taking rounding into
account); it can even be cheaper or comparable to paying for
re-upload/duplicate calls.

Infinite delay can be an issue though, I agree.
Maybe @Yun can clarify the likelihood of never deleting some SST files
by RocksDB?
For the changelog backend, old files won't be used once
materialization succeeds.

Yes, my concern is checkpointing time, but also added complexity:
> It would be a bit invasive though, as we would have to somehow keep track 
> which files should not be reused on TMs.
I think we need this anyway if we choose to re-upload files once the
job is running.
The new checkpoint must be formed by re-uploaded old artifacts AND
uploaded new artifacts.


Regards,
Roman


On Mon, Nov 22, 2021 at 12:42 PM Dawid Wysakowicz
 wrote:
>
> @Yun
>
> I think it is a good comment with I agree in principal. However, we use 
> --fromSavepoint (cli), savepointPath (REST API), SavepointRestoreSettings for 
> both restoring from a savepoint and an externalized checkpoint already. I 
> wanted to voice that concern. Nevertheless I am fine with changing it to 
> execution.restore-mode, if there are no other comments on that matter, I will 
> change it.
>
> @Roman:
>
> Re 1. Correct, stop-with-savepoint should commit side-effects. Will add that 
> to the doc.
>
> Re.2 What I don't like about this counter proposal is that it still has no 
> clearly defined point in time when it is safe to delete the original 
> checkpoint. Users would have a hard time reasoning about it and debugging. 
> Even worse, I think worst case it might never happen that all the original 
> files are no longer in use (I am not too familiar with RocksDB compaction, 
> but what happens if there are key ranges that are never accessed again?) I 
> agree it is unlikely, but possible, isn't it? Definitely it can take a 
> significant time and many checkpoints to do so.
>
> Re. 3 I believe where you are coming from is that you'd like to keep the 
> checkpointing time minimal and reuploading files may increase it. The 
> proposal so far builds on the assumption we could in most cases use a cheap 
> duplicate API instead of re-upload. I could see this as a follow-up if it 
> becomes a bottleneck. It would be a bit invasive though, as we would have to 
> somehow keep track which files should not be reused on TMs.
>
> Re. 2 & 3 Neither of the counter proposals work well for taking incremental 
> savepoints. We were thinking of building incremental savepoints on the same 
> concept. I think delaying the completion of an independent savepoint to a 
> closer undefined future is not a nice property of savepoints.
>
> Re 4. Good point. We should make sure the first completed checkpoint has the 
> independent/full checkpoint property rather than just the first triggered.
>
> Re. 5 & 6 I need a bit more time to look into it.
>
> Best,
>
> Dawid
>
> On 22/11/2021 11:40, Roman Khachatryan wrote:
>
> Hi,
>
> Thanks for the proposal Dawid, I have some questions and remarks:
>
> 1. How will stop-with-savepoint be handled?
> Shouldn't side effects be enforced in this case? (i.e. send
> notifyCheckpointComplete)
>
> 2. Instead of forcing re-upload, can we "inverse control" in no-claim mode?
> Anyways, any external tool will have to poll Flink API waiting for the
> next (full) checkpoint, before deleting the retained checkpoint,
> right?
> Instead, we can provide an API which tells whether the 1st checkpoint
> is still in use (and not force re-upload it).
>
> Under the hood, it can work like this:
> - for the checkpoint Flink recovers from, remember all shared state
> handles it is adding
> - when unregistering shared state handles, remove them from the set above
> - when the set becomes empty the 1st checkpoint can be deleted externally
>
> Besides not requiring re-upload, it seems much simpler and less invasive.
> On the downside, state deletion can be delayed; but I think this is a
> reasonable trade-off.
>
> 3. Alternatively, re-upload not necessarily on 1st checkpoint, but
> after a configured number of checkpoints?
> There 

Re: [DISCUSS] FLIP-193: Snapshots ownership

2021-11-22 Thread Roman Khachatryan
> If you assume the 1st checkpoint needs to be "full" you know you are not 
> allowed to use any shared files.
> It's true you should know about the shared files of the previous checkpoint, 
> but e.g. RocksDB already tracks that.

I mean that the design described by FLIP implies the following (PCIIW):
1. treat SST files from the initial checkpoint specially: re-upload or
send placeholder - depending on those attributes in state handle
2. (SST files from newer checkpoints are re-uploaded depending on
confirmation currently; so yes there is tracking, but it's different)
3. SharedStateRegistry must allow replacing state under the existing
key; otherwise, if a new key is used then other parallel subtasks
should learn somehow this key and use it; However, allowing
replacement must be limited to this scenario, otherwise it can lead to
previous checkpoint corruption in normal cases

Forcing a full checkpoint after completing N checkpoints instead of
immediately would only require enabling (1) after N checkpoints.
And with the "poll API until checkpoint released" approach, those
changes aren't necessary.

> There is one more fundamental issue with either of your two proposals that've 
> just came to my mind.
> What happens if you have externalized checkpoints and the job fails before 
> the initial checkpoint can be safely removed?

You start the job from the latest created checkpoint and wait for it
to be allowed for deletion. Then you can delete it, and all previous
checkpoints (or am I missing something?)

> With tracking the shared files on JM you can not say if you can clear the 
> files after couple of checkpoints or 10s, 100s or 1000s,
> which translates into minutes/hours/days/weeks of processing.
This doesn't necessarily translate into higher cost (because of saved
RPC etc., as I mentioned above).
However, I do agree that an infinite or arbitrary high delay is unacceptable.

The added complexity above doesn't seem negligible to me (especially
in SharedStateHandle); and should therefore be weighted against those
operational disadvantages (given that the number of checkpoints to
wait is bounded in practice).

Regards,
Roman




On Mon, Nov 22, 2021 at 5:05 PM Dawid Wysakowicz  wrote:
>
> There is one more fundamental issue with either of your two proposals
> that've just came to my mind. What happens if you have externalized
> checkpoints and the job fails before the initial checkpoint can be
> safely removed? You have a situation where you have a retained
> checkpoint that was built on top of the original one. Basically ending
> in a situation we have right now that you never know when it is safe to
> delete a retained checkpoint.
>
> BTW, the intention for the "claim" mode was to support cases when users
> are concerned with the performance of the first checkpoint. In those
> cases they can claim the checkpoint on don't pay the additional cost of
> the first checkpoint.
>
> Best,
>
> Dawid
>
> On 22/11/2021 14:09, Roman Khachatryan wrote:
> > Thanks Dawid,
> >
> > Regarding clarity,
> > I think that all proposals require waiting for some event: re-upload /
> > checkpoint completion / api response.
> > But with the current one, there is an assumption: "initial checkpoint
> > can be deleted once a new one completes" (instead of just "initial
> > checkpoint can be deleted once the API says it can be deleted").
> > So I think it's actually more clear to offer this explicit API and rely on 
> > it.
> >
> > Regarding delaying the deletion,
> > I agree that it can delay deletion, but how important is it?
> > Checkpoints are usually stored on relatively cheap storage like S3, so
> > some delay shouldn't be an issue (especially taking rounding into
> > account); it can even be cheaper or comparable to paying for
> > re-upload/duplicate calls.
> >
> > Infinite delay can be an issue though, I agree.
> > Maybe @Yun can clarify the likelihood of never deleting some SST files
> > by RocksDB?
> > For the changelog backend, old files won't be used once
> > materialization succeeds.
> >
> > Yes, my concern is checkpointing time, but also added complexity:
> >> It would be a bit invasive though, as we would have to somehow keep track 
> >> which files should not be reused on TMs.
> > I think we need this anyway if we choose to re-upload files once the
> > job is running.
> > The new checkpoint must be formed by re-uploaded old artifacts AND
> > uploaded new artifacts.
> >
> >
> > Regards,
> > Roman
> >
> >
> > On Mon, Nov 22, 2021 at 12:42 PM Dawid Wysakowicz
> >  wrote:
> >> @Yu

Re: [DISCUSS] FLIP-193: Snapshots ownership

2021-11-23 Thread Roman Khachatryan
l users explicitly if choose to make Flink manage the checkpoints 
> by default.
>
> Best
> Yun Tang
>
>
> On 2021/11/22 19:49:11 Dawid Wysakowicz wrote:
>
> There is one more fundamental issue with either of your two
> proposals that've just came to my mind.
> What happens if you have externalized checkpoints and the job fails
> before the initial checkpoint can be safely removed?
>
> You start the job from the latest created checkpoint and wait for it
> to be allowed for deletion. Then you can delete it, and all previous
> checkpoints (or am I missing something?)
>
>
> Let me clarify it with an example. You start with chk-42, Flink takes
> e.g. three checkpoints chk-43, chk-44, chk-45 all still reference chk-42
> files. After that it fails. We have externalized checkpoints enabled,
> therefore we have retained all checkpoints. Users starts a new program
> from let's say chk-45. At this point your proposal does not give the
> user any help in regards when chk-42 can be safely removed. (This is
> also how Flink works right now).
>
> To make it even harder you can arbitrarily complicate it, 1) start a job
> from chk-44, 2) start a job from a chk-47 which depends on chk-45, 3)
> never start a job from chk-44, it is not claimed by any job, thus it is
> never deleted, users must remember themselves that chk-44 originated
> from chk-42 etc.) User would be forced to build a lineage system for
> checkpoints to track which checkpoints depend on each other.
>
> I mean that the design described by FLIP implies the following (PCIIW):
> 1. treat SST files from the initial checkpoint specially: re-upload or
> send placeholder - depending on those attributes in state handle
> 2. (SST files from newer checkpoints are re-uploaded depending on
> confirmation currently; so yes there is tracking, but it's different)
> 3. SharedStateRegistry must allow replacing state under the existing
> key; otherwise, if a new key is used then other parallel subtasks
> should learn somehow this key and use it; However, allowing
> replacement must be limited to this scenario, otherwise it can lead to
> previous checkpoint corruption in normal cases
>
> I might not understand your points, but I don't think FLIP implies any
> of this. The FLIP suggests to send along with the CheckpointBarrier a
> flag "force full checkpoint". Then the state backend should respect it
> and should not use any of the previous shared handles. Now let me
> explain how that would work for RocksDB incremental checkpoints.
>
>  1. Simplest approach: upload all local RocksDB files. This works
> exactly the same as the first incremental checkpoint for a fresh start.
>  2. Improvement on 1) we already do know which files were uploaded for
> the initial checkpoint. Therefore instead of uploading the local
> files that are same with files uploaded for the initial checkpoint
> we call duplicate for those files and upload just the diff.
>
> It does not require any changes to the SharedStateRegistry nor to state
> handles, at least for RocksDB.
>
> Best,
>
> Dawid
>
>
> On 22/11/2021 19:33, Roman Khachatryan wrote:
>
> If you assume the 1st checkpoint needs to be "full" you know you are not 
> allowed to use any shared files.
> It's true you should know about the shared files of the previous checkpoint, 
> but e.g. RocksDB already tracks that.
>
> I mean that the design described by FLIP implies the following (PCIIW):
> 1. treat SST files from the initial checkpoint specially: re-upload or
> send placeholder - depending on those attributes in state handle
> 2. (SST files from newer checkpoints are re-uploaded depending on
> confirmation currently; so yes there is tracking, but it's different)
> 3. SharedStateRegistry must allow replacing state under the existing
> key; otherwise, if a new key is used then other parallel subtasks
> should learn somehow this key and use it; However, allowing
> replacement must be limited to this scenario, otherwise it can lead to
> previous checkpoint corruption in normal cases
>
> Forcing a full checkpoint after completing N checkpoints instead of
> immediately would only require enabling (1) after N checkpoints.
> And with the "poll API until checkpoint released" approach, those
> changes aren't necessary.
>
>
> There is one more fundamental issue with either of your two proposals that've 
> just came to my mind.
> What happens if you have externalized checkpoints and the job fails before 
> the initial checkpoint can be safely removed?
>
> You start the job from the latest created checkpoint and wait for it
>

Re: [DISCUSS] Deprecate Java 8 support

2021-11-25 Thread Roman Khachatryan
The situation is probably a bit different now compared to the previous
upgrade: some users might be using Amazon Coretto (or other builds)
which have longer support.

Still +1 for deprecation to trigger migration, and thanks for bringing this up!

Regards,
Roman

On Thu, Nov 25, 2021 at 10:09 AM Arvid Heise  wrote:
>
> +1 to deprecate Java 8, so we can hopefully incorporate the module concept
> in Flink.
>
> On Thu, Nov 25, 2021 at 9:49 AM Chesnay Schepler  wrote:
>
> > Users can already use APIs from Java 8/11.
> >
> > On 25/11/2021 09:35, Francesco Guardiani wrote:
> > > +1 with what both Ingo and Matthias sad, personally, I cannot wait to
> > start using some of
> > > the APIs introduced in Java 9. And I'm pretty sure that's the same for
> > our users as well.
> > >
> > > On Tuesday, 23 November 2021 13:35:07 CET Ingo Bürk wrote:
> > >> Hi everyone,
> > >>
> > >> continued support for Java 8 can also create project risks, e.g. if a
> > >> vulnerability arises in Flink's dependencies and we cannot upgrade them
> > >> because they no longer support Java 8. Some projects already started
> > >> deprecating support as well, like Kafka, and other projects will likely
> > >> follow.
> > >> Let's also keep in mind that the proposal here is not to drop support
> > right
> > >> away, but to deprecate it, send the message, and motivate users to start
> > >> migrating. Delaying this process could ironically mean users have less
> > time
> > >> to prepare for it.
> > >>
> > >>
> > >> Ingo
> > >>
> > >> On Tue, Nov 23, 2021 at 8:54 AM Matthias Pohl 
> > >>
> > >> wrote:
> > >>> Thanks for constantly driving these maintenance topics, Chesnay. +1
> > from
> > >>> my
> > >>> side for deprecating Java 8. I see the point Jingsong is raising. But I
> > >>> agree with what David already said here. Deprecating the Java version
> > is a
> > >>> tool to make users aware of it (same as starting this discussion
> > thread).
> > >>> If there's no major opposition against deprecating it in the community
> > we
> > >>> should move forward in this regard to make the users who do not
> > >>> regularly browse the mailing list aware of it. That said, deprecating
> > Java
> > >>> 8 in 1.15 does not necessarily mean that it is dropped in 1.16.
> > >>>
> > >>> Best,
> > >>> Matthias
> > >>>
> > >>> On Tue, Nov 23, 2021 at 8:46 AM David Morávek  wrote:
> >  Thank you Chesnay for starting the discussion! This will generate bit
> > of
> > >>> a
> > >>>
> >  work for some users, but it's a good thing to keep moving the project
> >  forward. Big +1 for this.
> > 
> >  Jingsong:
> > 
> >  Receiving this signal, the user may be unhappy because his application
> > 
> > > may be all on Java 8. Upgrading is a big job, after all, many systems
> > > have not been upgraded yet. (Like you said, HBase and Hive)
> >  The whole point of deprecation is to raise awareness, that this will
> > be
> >  happening eventually and users should take some steps to address this
> > in
> >  medium-term. If I understand Chesnay correctly, we'd still keep Java 8
> >  around for quite some time to give users enough time to upgrade, but
> >  without raising awareness we'd fight the very same argument later in
> > >>> time.
> > >>>
> >  All of the prerequisites from 3rd party projects for both HBase [1]
> > and
> >  Hive [2] to fully support Java 11 have been completed, so the ball is
> > on
> >  their side and there doesn't seem to be much activity. Generating bit
> > >>> more
> > >>>
> >  pressure on these efforts might be a good thing.
> > 
> >  It would be great to identify some of these users and learn bit more
> > >>> about
> > >>>
> >  their situation. Are they keeping up with latest Flink developments or
> > >>> are
> > >>>
> >  they lagging behind (this would also give them way more time for
> >  eventual
> >  upgrade)?
> > 
> >  [1] https://issues.apache.org/jira/browse/HBASE-22972
> >  [2] https://issues.apache.org/jira/browse/HIVE-22415
> > 
> >  Best,
> >  D.
> > 
> >  On Tue, Nov 23, 2021 at 3:08 AM Jingsong Li 
> > 
> >  wrote:
> > > Hi Chesnay,
> > >
> > > Thanks for bringing this for discussion.
> > >
> > > We should dig deeper into the current Java version of Flink users. At
> > > least make sure Java 8 is not a mainstream version.
> > >
> > > Receiving this signal, the user may be unhappy because his
> > application
> > > may be all on Java 8. Upgrading is a big job, after all, many systems
> > > have not been upgraded yet. (Like you said, HBase and Hive)
> > >
> > > In my opinion, it is too early to deprecate support for Java 8. We
> > > should wait for a safer point in time.
> > >
> > > On Mon, Nov 22, 2021 at 11:45 PM Ingo Bürk 
> > wrote:
> > >> Hi,
> > >>
> > >> also a +1 from me because of everything Chesnay already said.
> > >>
> > >>>

Re: [DISCUSS] FLIP-193: Snapshots ownership

2021-11-26 Thread Roman Khachatryan
Hi,

Thanks for updating the FLIP Dawid

There seems to be a consensus in the discussion, however, I couldn't
find stop-with-savepoint in the document.

A few minor things:
- maybe include "checkpoint" in mode names, i.e. --no-claim-checkpoint?
- add an explicit option to preserve the current behavior (no claim
and no duplicate)?
And I still think it would be nice to list object stores which support
duplicate operation.

Regards,
Roman


On Fri, Nov 26, 2021 at 10:37 AM Konstantin Knauf  wrote:
>
> Hi Dawid,
>
> sounds good, specifically 2., too.
>
> Best,
>
> Konstantin
>
> On Fri, Nov 26, 2021 at 9:25 AM Dawid Wysakowicz 
> wrote:
>
> > Hi all,
> >
> > I updated the FLIP with a few clarifications:
> >
> >1. I added a description how would we trigger a "full snapshot" in the
> >changelog state backend
> >   - (We would go for this option in the 1st version). Trigger a
> >   snapshot of the base state backend in the 1st checkpoint, which 
> > induces
> >   materializing the changelog. In this approach we could duplicate SST 
> > files,
> >   but we would not duplicate the diff files.
> >- Add a hook for logic for computing which task should duplicate the
> >   diff files. We would have to do a pass over all states after the state
> >   assignment in StateAssignmentOperation
> >   2. I clarified that the "no-claim" mode requires a
> >completed/successful checkpoint before we can remove the one we are
> >restoring from. Also added a note that we can assume a checkpoint is
> >completed if it is confirmed by Flink's API for checkpointing stats or by
> >checking an entry in HA services. A checkpoint can not be assumed 
> > completed
> >by just looking at the checkpoint files.
> >
> > I suggest going on with the proposal for "no-claim" as suggested so far,
> > as it is easier to understand by users. They can reliably tell when they
> > can expect the checkpoint to be deletable. If we see that the time to take
> > the 1st checkpoint becomes a problem we can extend the set of restore
> > methods and e.g. add a "claim-temporarily" method.
> >
> > I hope we can reach a consensus and start a vote, some time early next
> > week.
> >
> > Best,
> >
> > Dawid
> >
> > On 23/11/2021 22:39, Roman Khachatryan wrote:
> >
> > I also referred to the "no-claim" mode and I still think neither of them 
> > works in that mode, as you'd have to keep lineage of checkpoints externally 
> > to be able delete any checkpoint.
> >
> > I think the lineage is needed in all approaches with arbitrary
> > histories; the difference is whether a running Flink is required or
> > not. Is that what you mean?
> > (If not, could you please explain how the scenario you mentioned above
> > with multiple jobs branching from the same checkpoint is handled?)
> >
> >
> > BTW, the state key for RocksDB is actually: backend UID + key group range + 
> > SST file name, so the key would be different (the key group range is 
> > different for two tasks) and we would've two separate counters for the same 
> > file.
> >
> > You're right. But there is also a collision between old and new entries.
> >
> >
> > To be on the same page here. It is not a problem so far in RocksDB, because 
> > we do not reuse any shared files in case of rescaling.
> >
> > As I mentioned above, collision happens not only because of rescaling;
> > and AFAIK, there are some ideas to reuse files on rescaling (probably
> > Yuan could clarify). Anyways, I think it makes sense to not bake in
> > this assumption unless it's hard to implement (or at least state it
> > explicitly in FLIP).
> >
> >
> > It is not suggested as an optimization. It is suggested as a must for state 
> > backends that need it. I did not elaborate on it, because it could affected 
> > only the changelog state backend at the moment, which I don't have much 
> > insights. I agree it might make sense to look a bit how we could force full 
> > snapshots in the changelog state backend. I will spend some extra time on 
> > that.
> >
> > I see. For the Changelog state backend, the easiest way would be to
> > obtain a full snapshot from the underlying backend in snapshot(),
> > ignoring all non-materialized changes. This will effectively
> > materialize all the changes, so only new non-materialized state will
> > be used in subsequent checkpoints.
> >
> >
> &

Re: [VOTE] FLIP-193: Snapshots ownership

2021-12-02 Thread Roman Khachatryan
+1

Thanks for driving this effort Dawid

Regards,
Roman


On Wed, Dec 1, 2021 at 2:04 PM Konstantin Knauf  wrote:
>
> Thanks, Dawid.
>
> +1
>
> On Wed, Dec 1, 2021 at 1:23 PM Dawid Wysakowicz 
> wrote:
>
> > Dear devs,
> >
> > I'd like to open a vote on FLIP-193: Snapshots ownership [1] which was
> > discussed in this thread [2].
> > The vote will be open for at least 72 hours unless there is an objection or
> > not enough votes.
> >
> > Best,
> >
> > Dawid
> >
> > [1] https://cwiki.apache.org/confluence/x/bIyqCw
> >
> > [2] https://lists.apache.org/thread/zw2crf0c7t7t4cb5cwcwjpvsb3r1ovz2
> >
> >
> >
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk


Re: [DISCUSS] FLIP-191: Extend unified Sink interface to support small file compaction

2021-12-02 Thread Roman Khachatryan
Thanks for clarifying (I was initially confused by merging state files
rather than output files).

> At some point, Flink will definitely have some WAL adapter that can turn any 
> sink into an exactly-once sink (with some caveats). For now, we keep that as 
> an orthogonal solution as it has a rather high price (bursty workload with 
> high latency). Ideally, we can keep the compaction asynchronously...

Yes, that would be something like a WAL. I agree that it would have a
different set of trade-offs.


Regards,
Roman

On Mon, Nov 29, 2021 at 3:33 PM Arvid Heise  wrote:
>>
>> > One way to avoid write-read-merge is by wrapping SinkWriter with
>> > another one, which would buffer input elements in a temporary storage
>> > (e.g. local file) until a threshold is reached; after that, it would
>> > invoke the original SinkWriter. And if a checkpoint barrier comes in
>> > earlier, it would send written data to some aggregator.
>>
>> I think perhaps this seems to be a kind of WAL method? Namely we first
>> write the elements to some WAL logs and persist them on checkpoint
>> (in snapshot or remote FS), or we directly write WAL logs to the remote
>> FS eagerly.
>>
> At some point, Flink will definitely have some WAL adapter that can turn any 
> sink into an exactly-once sink (with some caveats). For now, we keep that as 
> an orthogonal solution as it has a rather high price (bursty workload with 
> high latency). Ideally, we can keep the compaction asynchronously...
>
> On Mon, Nov 29, 2021 at 8:52 AM Yun Gao  wrote:
>>
>> Hi,
>>
>> @Roman very sorry for the late response for a long time,
>>
>> > Merging artifacts from multiple checkpoints would apparently
>> require multiple concurrent checkpoints
>>
>> I think it might not need concurrent checkpoints: suppose some
>> operators (like the committer aggregator in the option 2) maintains
>> the list of files to merge, it could stores the lists of files to merge
>> in the states, then after several checkpoints are done and we have
>> enough files, we could merge all the files in the list.
>>
>> > Asynchronous merging in an aggregator would require some resolution
>> > logic on recovery, so that a merged artifact can be used if the
>> > original one was deleted. Otherwise, wouldn't recovery fail because
>> > some artifacts are missing?
>> > We could also defer deletion until the "compacted" checkpoint is
>> > subsumed - but isn't it too late, as it will be deleted anyways once
>> > subsumed?
>>
>> I think logically we could delete the original files once the "compacted" 
>> checkpoint
>> (which finish merging the compacted files and record it in the checkpoint) 
>> is completed
>> in all the options. If there are failover before we it, we could restart the 
>> merging and if
>> there are failover after it, we could have already recorded the files in the 
>> checkpoint.
>>
>> > One way to avoid write-read-merge is by wrapping SinkWriter with
>> > another one, which would buffer input elements in a temporary storage
>> > (e.g. local file) until a threshold is reached; after that, it would
>> > invoke the original SinkWriter. And if a checkpoint barrier comes in
>> > earlier, it would send written data to some aggregator.
>>
>> I think perhaps this seems to be a kind of WAL method? Namely we first
>> write the elements to some WAL logs and persist them on checkpoint
>> (in snapshot or remote FS), or we directly write WAL logs to the remote
>> FS eagerly.
>>
>> Sorry if I do not understand correctly somewhere.
>>
>> Best,
>> Yun
>>
>>
>> --
>> From:Roman Khachatryan 
>> Send Time:2021 Nov. 9 (Tue.) 22:03
>> To:dev 
>> Subject:Re: [DISCUSS] FLIP-191: Extend unified Sink interface to support 
>> small file compaction
>>
>> Hi everyone,
>>
>> Thanks for the proposal and the discussion, I have some remarks:
>> (I'm not very familiar with the new Sink API but I thought about the
>> same problem in context of the changelog state backend)
>>
>> 1. Merging artifacts from multiple checkpoints would apparently
>> require multiple concurrent checkpoints (otherwise, a new checkpoint
>> won't be started before completing the previous one; and the previous
>> one can't be completed before durably storing the artifacts). However,
>> concurrent checkpoints are currently not supported with Unaligned
>> checkpoints (this is besides increasing e2e-latency).
>>
>> 2. Asynchronous merging in an aggregator would require some resolution
>> logic on recovery, so that a merged artifact can be used if the
>> original one was deleted. Otherwise, wouldn't recovery fail because
>> some artifacts are missing?
>> We could also defer deletion until the "compacted" checkpoint is
>> subsumed - but isn't it too late, as it will be deleted anyways once
>> subsumed?
>>
>> 3. Writing small files, then reading and merging them for *every*
>> checkpoint seems worse than only reading them on recovery. I guess I'm
>> missing some cases of reading, s

Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl

2021-12-06 Thread Roman Khachatryan
Congratulations, Matthias!

Regards,
Roman


On Mon, Dec 6, 2021 at 11:04 AM Yang Wang  wrote:
>
> Congratulations, Matthias!
>
> Best,
> Yang
>
> Sergey Nuyanzin  于2021年12月6日周一 下午3:35写道:
>
> > Congratulations, Matthias!
> >
> > On Mon, Dec 6, 2021 at 7:33 AM Leonard Xu  wrote:
> >
> > > Congratulations Matthias!
> > >
> > > Best,
> > > Leonard
> > > > 2021年12月3日 下午11:23,Matthias Pohl  写道:
> > > >
> > > > Thank you! I'm looking forward to continue working with you.
> > > >
> > > > On Fri, Dec 3, 2021 at 7:29 AM Jingsong Li 
> > > wrote:
> > > >
> > > >> Congratulations, Matthias!
> > > >>
> > > >> On Fri, Dec 3, 2021 at 2:13 PM Yuepeng Pan  wrote:
> > > >>>
> > > >>> Congratulations Matthias!
> > > >>>
> > > >>> Best,Yuepeng Pan.
> > > >>> 在 2021-12-03 13:47:20,"Yun Gao"  写道:
> > >  Congratulations Matthias!
> > > 
> > >  Best,
> > >  Yun
> > > 
> > > 
> > >  --
> > >  From:Jing Zhang 
> > >  Send Time:2021 Dec. 3 (Fri.) 13:45
> > >  To:dev 
> > >  Cc:Matthias Pohl 
> > >  Subject:Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl
> > > 
> > >  Congratulations, Matthias!
> > > 
> > >  刘建刚  于2021年12月3日周五 11:51写道:
> > > 
> > > > Congratulations!
> > > >
> > > > Best,
> > > > Liu Jiangang
> > > >
> > > > Till Rohrmann  于2021年12月2日周四 下午11:28写道:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> On behalf of the PMC, I'm very happy to announce Matthias Pohl as
> > a
> > > >> new
> > > >> Flink committer.
> > > >>
> > > >> Matthias has worked on Flink since August last year. He helped
> > > >> review a
> > > > ton
> > > >> of PRs. He worked on a variety of things but most notably the
> > > >> tracking
> > > > and
> > > >> reporting of concurrent exceptions, fixing HA bugs and deprecating
> > > >> and
> > > >> removing our Mesos support. He actively reports issues helping
> > > >> Flink to
> > > >> improve and he is actively engaged in Flink's MLs.
> > > >>
> > > >> Please join me in congratulating Matthias for becoming a Flink
> > > >> committer!
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best, Jingsong Lee
> > >
> > >
> >
> > --
> > Best regards,
> > Sergey
> >


Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk

2021-12-06 Thread Roman Khachatryan
Congratulations, Ingo!

Regards,
Roman


On Mon, Dec 6, 2021 at 11:05 AM Yang Wang  wrote:
>
> Congratulations, Ingo!
>
> Best,
> Yang
>
> Sergey Nuyanzin  于2021年12月6日周一 下午3:35写道:
>
> > Congratulations, Ingo!
> >
> > On Mon, Dec 6, 2021 at 7:32 AM Leonard Xu  wrote:
> >
> > > Congratulations, Ingo! Well Deserved.
> > >
> > > Best,
> > > Leonard
> > >
> > > > 2021年12月3日 下午11:24,Ingo Bürk  写道:
> > > >
> > > > Thank you everyone for the warm welcome!
> > > >
> > > >
> > > > Best
> > > > Ingo
> > > >
> > > > On Fri, Dec 3, 2021 at 11:47 AM Ryan Skraba
> >  > > >
> > > > wrote:
> > > >
> > > >> Congratulations Ingo!
> > > >>
> > > >> On Fri, Dec 3, 2021 at 8:17 AM Yun Tang  wrote:
> > > >>
> > > >>> Congratulations, Ingo!
> > > >>>
> > > >>> Best
> > > >>> Yun Tang
> > > >>> 
> > > >>> From: Yuepeng Pan 
> > > >>> Sent: Friday, December 3, 2021 14:14
> > > >>> To: dev@flink.apache.org 
> > > >>> Cc: Ingo Bürk 
> > > >>> Subject: Re:Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> Congratulations, Ingo!
> > > >>>
> > > >>>
> > > >>> Best,
> > > >>> Yuepeng Pan
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> At 2021-12-03 13:47:38, "Yun Gao" 
> > > wrote:
> > >  Congratulations Ingo!
> > > 
> > >  Best,
> > >  Yun
> > > 
> > > 
> > >  --
> > >  From:刘建刚 
> > >  Send Time:2021 Dec. 3 (Fri.) 11:52
> > >  To:dev 
> > >  Cc:"Ingo Bürk" 
> > >  Subject:Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk
> > > 
> > >  Congratulations!
> > > 
> > >  Best,
> > >  Liu Jiangang
> > > 
> > >  Till Rohrmann  于2021年12月2日周四 下午11:24写道:
> > > 
> > > > Hi everyone,
> > > >
> > > > On behalf of the PMC, I'm very happy to announce Ingo Bürk as a new
> > > >>> Flink
> > > > committer.
> > > >
> > > > Ingo has started contributing to Flink since the beginning of this
> > > >>> year. He
> > > > worked mostly on SQL components. He has authored many PRs and
> > helped
> > > >>> review
> > > > a lot of other PRs in this area. He actively reported issues and
> > > >> helped
> > > >>> our
> > > > users on the MLs. His most notable contributions were Support SQL
> > > 2016
> > > >>> JSON
> > > > functions in Flink SQL (FLIP-90), Register sources/sinks in Table
> > API
> > > > (FLIP-129) and various other contributions in the SQL area.
> > Moreover,
> > > >>> he is
> > > > one of the few people in our community who actually understands
> > > >> Flink's
> > > > frontend.
> > > >
> > > > Please join me in congratulating Ingo for becoming a Flink
> > committer!
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > Best regards,
> > Sergey
> >


Re: Could not find any factory for identifier 'jdbc'

2022-01-12 Thread Roman Khachatryan
Hi,

I think Chesnay's suggestion to double-check the bundle makes sense.
Additionally, I'd try flink-connector-jdbc_2.12 instead of
flink-connector-jdbc_2.11.

Regards,
Roman

On Wed, Jan 12, 2022 at 12:23 PM Chesnay Schepler  wrote:
>
> I would try double-checking whether the jdbc connector was truly bundled
> in your jar, specifically whether
> org.apache.flink.connector.jdbc.table.JdbcDynamicTableFactory is.
>
> I can't think of a reason why this shouldn't work for the JDBC connector.
>
> On 12/01/2022 06:34, Ronak Beejawat (rbeejawa) wrote:
> > Hi Chesnay,
> >
> > How do you ensure that the connector is actually available at runtime?
> >
> > We are providing below mentioned dependency inside pom.xml with scope 
> > compile that will be available in class path and it was there in my fink 
> > job bundled jar. Same we are doing the same for other connector say kafka 
> > it worked for that
> >
> > 
> >org.apache.flink
> >flink-connector-jdbc_2.11
> >1.14.2
> > 
> > 
> >mysql
> >mysql-connector-java
> >5.1.41
> > 
> >
> > Are you bundling it in a jar or putting it into Flinks lib directory?
> > Yes we are building jar it is bundled with that but still we saw this error 
> > . So we tried the workaround which is mentioned in some article to put 
> > inside a flink lib directory then it worked 
> > https://blog.csdn.net/weixin_44056920/article/details/118110949 . So this 
> > is extra stuff which we have to do to make it work with restart of cluster .
> >
> > But the question is how it worked for kafka and not for jdbc ? I didn't put 
> > kafka jar explicitly in flink lib folder
> >
> > Note : I am using flink release 1.14 version for all my job execution / 
> > implementation which is a stable version I guess
> >
> > Thanks
> > Ronak Beejawat
> > From: Chesnay Schepler mailto:ches...@apache.org>>
> > Date: Tuesday, 11 January 2022 at 7:45 PM
> > To: Ronak Beejawat (rbeejawa) 
> > mailto:rbeej...@cisco.com.INVALID>>, 
> > u...@flink.apache.org 
> > mailto:u...@flink.apache.org>>
> > Cc: Hang Ruan mailto:ruanhang1...@gmail.com>>, 
> > Shrinath Shenoy K (sshenoyk) 
> > mailto:sshen...@cisco.com>>, Karthikeyan Muthusamy 
> > (karmuthu) mailto:karmu...@cisco.com>>, Krishna 
> > Singitam (ksingita) mailto:ksing...@cisco.com>>, Arun 
> > Yadav (aruny) mailto:ar...@cisco.com>>, Jayaprakash 
> > Kuravatti (jkuravat) mailto:jkura...@cisco.com>>, Avi 
> > Sanwal (asanwal) mailto:asan...@cisco.com>>
> > Subject: Re: Could not find any factory for identifier 'jdbc'
> > How do you ensure that the connector is actually available at runtime?
> > Are you bundling it in a jar or putting it into Flinks lib directory?
> >
> > On 11/01/2022 14:14, Ronak Beejawat (rbeejawa) wrote:
> >> Correcting subject -> Could not find any factory for identifier 'jdbc'
> >>
> >> From: Ronak Beejawat (rbeejawa)
> >> Sent: Tuesday, January 11, 2022 6:43 PM
> >> To: 'dev@flink.apache.org' 
> >> mailto:dev@flink.apache.org>>; 
> >> 'commun...@flink.apache.org' 
> >> mailto:commun...@flink.apache.org>>; 
> >> 'u...@flink.apache.org' 
> >> mailto:u...@flink.apache.org>>
> >> Cc: 'Hang Ruan' mailto:ruanhang1...@gmail.com>>; 
> >> Shrinath Shenoy K (sshenoyk) 
> >> mailto:sshen...@cisco.com>>; Karthikeyan Muthusamy 
> >> (karmuthu) mailto:karmu...@cisco.com>>; Krishna 
> >> Singitam (ksingita) mailto:ksing...@cisco.com>>; Arun 
> >> Yadav (aruny) mailto:ar...@cisco.com>>; Jayaprakash 
> >> Kuravatti (jkuravat) mailto:jkura...@cisco.com>>; Avi 
> >> Sanwal (asanwal) mailto:asan...@cisco.com>>
> >> Subject: what is efficient way to write Left join in flink
> >>
> >> Hi Team,
> >>
> >> Getting below exception while using jdbc connector :
> >>
> >> Caused by: org.apache.flink.table.api.ValidationException: Could not find 
> >> any factory for identifier 'jdbc' that implements 
> >> 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath.
> >>
> >> Available factory identifiers are:
> >>
> >> blackhole
> >> datagen
> >> filesystem
> >> kafka
> >> print
> >> upsert-kafka
> >>
> >>
> >> I have already added dependency for jdbc connector in pom.xml as mentioned 
> >> below:
> >>
> >> 
> >> org.apache.flink
> >>  flink-connector-jdbc_2.11
> >>  1.14.2
> >> 
> >> 
> >> mysql
> >>  mysql-connector-java
> >>  5.1.41
> >> 
> >>
> >> Referred release doc link for the same 
> >> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/jdbc/
> >>
> >>
> >>
> >> Please help me on this and provide the solution for it !!!
> >>
> >>
> >> Thanks
> >> Ronak Beejawat
>
>


Re: OutOfMemoryError: Java heap space while implmentating flink sql api

2022-01-12 Thread Roman Khachatryan
Hi Ronak,

You shared a screenshot of JM. Do you mean that exception also happens
on JM? (I'd rather assume TM).

Could you explain the join clause: left join ccmversionsumapTable cvsm
ON (cdr.version = cvsm.ccmversion)
"version" doesn't sound very selective, so maybe you end up with
(almost) Cartesian product?

Regards,
Roman

On Wed, Jan 12, 2022 at 11:06 AM Ronak Beejawat (rbeejawa)
 wrote:
>
> Hi Team,
>
> I was trying to implement flink sql api join with 2 tables it is throwing 
> error OutOfMemoryError: Java heap space . PFB screenshot for flink cluster 
> memory details.
> [Flink Memory Model][1]
>
>
>   [1]: https://i.stack.imgur.com/AOnQI.png
>
> **PFB below code snippet which I was trying:**
> ```
> EnvironmentSettings settings = 
> EnvironmentSettings.newInstance().inStreamingMode().build();
> TableEnvironment tableEnv = TableEnvironment.create(settings);
>
>
> tableEnv.getConfig().getConfiguration().setString("table.optimizer.agg-phase-strategy",
>  "TWO_PHASE");
> tableEnv.getConfig().getConfiguration().setString("table.optimizer.join-reorder-enabled",
>  "true");
> tableEnv.getConfig().getConfiguration().setString("table.exec.resource.default-parallelism",
>  "16");
>
> tableEnv.executeSql("CREATE TEMPORARY TABLE ccmversionsumapTable (\r\n"
>  + "  suname STRING\r\n"
>  + "  ,ccmversion STRING\r\n"
>  + "   )\r\n"
>  + "   WITH (\r\n"
>  + "   'connector' = 
> 'jdbc'\r\n"
>  + "   ,'url' = 
> 'jdbc:mysql://:3306/ccucdb'\r\n"
>  + "   ,'table-name' = 
> 'ccmversionsumap'\r\n"
>  + "   ,'username' = 
> '*'\r\n"
>  + "   ,'password' = 
> ''\r\n"
>  + "   )");
>
> tableEnv.executeSql("CREATE TEMPORARY TABLE cdrTable (\r\n"
>+ "   org_id STRING\r\n"
>+ "   ,cluster_id STRING\r\n"
>+ "   ,cluster_name STRING\r\n"
>+ "   ,version STRING\r\n"
>+ "   ,ip_address STRING\r\n"
>+ "   ,pkid STRING\r\n"
>+ "   ,globalcallid_callid INT\r\n"
>   ... --- multiple columns can be added
>+ "   )\r\n"
>+ "   WITH (\r\n"
>+ "   'connector' = 'kafka'\r\n"
>+ "   ,'topic' = 'cdr'\r\n"
>+ "   ,'properties.bootstrap.servers' = 
> ':9092'\r\n"
>+ "   ,'scan.startup.mode' = 
> 'earliest-offset'\r\n"
>//+ ",'value.fields-include' = 
> 'EXCEPT_KEY'\r\n"
>+ "   ,'format' = 'json'\r\n"
>+ "   )");
>
>
> String sql = "SELECT cdr.org_id orgid,\r\n"
>   + " 
> cdr.cluster_name clustername,\r\n"
>   + " 
> cdr.cluster_id clusterid,\r\n"
>   + " 
> cdr.ip_address clusteripaddr,\r\n"
>   + " 
> cdr.version clusterversion,\r\n"
>   + " 
> cvsm.suname clustersuname,\r\n"
>   + " 
> cdr.pkid cdrpkid,\r\n"
>   ... --- 
> multiple columns can be added
>   + " from 
> cdrTable cdr\r\n"
>   + " left join 
> ccmversionsumapTable cvsm ON (cdr.version = cvsm.ccmversion) group by 
> TUMBLE(PROCTIME(), INTERVAL '1' MINUTE), cdr.org_id, cdr.cluster_name, 
> cdr.cluster_id, cdr.ip_address, cdr.version, cdr.pkid, 
> cdr.globalcallid_callid, ..."
>
> Table order20 = tableEnv.sqlQuery(sql);
> order20.executeInsert("outputCdrTable");
> ```
>
> **scenario / use case :**
>
> we are pushing 2.5 million json record in kafka topic and reading it via 
> kafka connector as temporary cdrTable as shown in above code and we reading 
> 23 records from jdbc static/reference table via jdbc connector as temporary 
> ccmversionsumapTable as shown in above code and doing a left join for 1 min 
> tumble window .
>
> So while doing a

Re: [DISCUSS] Release Flink 1.14.4

2022-02-12 Thread Roman Khachatryan
Hi,

+1 for the proposal.

Thanks for volunteering Konstantin!

Regards,
Roman



On Fri, Feb 11, 2022 at 3:00 PM Till Rohrmann  wrote:
>
> +1 for a 1.14.4 release. The release-1.14 branch already includes 36 fixed
> issues, some of them quite severe.
>
> Cheers,
> Till
>
> On Fri, Feb 11, 2022 at 1:23 PM Martijn Visser 
> wrote:
>
> > Hi Konstantin,
> >
> > Thanks for opening the discussion. I think FLINK-25732 does indeed warrant
> > a speedy Flink 1.14.4 release. I would indeed also like to include
> > FLINK-26018.
> >
> > Best regards,
> >
> > Martijn
> >
> > Op vr 11 feb. 2022 om 10:29 schreef Konstantin Knauf 
> >
> > > Hi everyone,
> > >
> > > what do you think about a timely Flink 1.14.4 in order to release the fix
> > > for https://issues.apache.org/jira/browse/FLINK-25732. Currently, the
> > > landing page of Flink Web User Interface is not working when there are
> > > standby Jobmanagers.
> > >
> > > In addition, I propose to wait for
> > > https://issues.apache.org/jira/browse/FLINK-26018 to be resolved.
> > >
> > > I can volunteer as release manager.
> > >
> > > Cheers,
> > >
> > > Konstantin
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> > >
> > --
> >
> > Martijn Visser | Product Manager
> >
> > mart...@ververica.com
> >
> > 
> >
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward  - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >


Re: [ANNOUNCE] New Flink PMC members: Igal Shilman, Konstantin Knauf and Yun Gao

2022-02-16 Thread Roman Khachatryan
Congratulations!

Regards,
Roman

On Wed, Feb 16, 2022 at 5:22 PM Matthias Pohl  wrote:
>
> Congratulations to all of you :)
>
> On Wed, Feb 16, 2022 at 4:04 PM Till Rohrmann  wrote:
>
> > Congratulations! Well deserved!
> >
> > Cheers,
> > Till
> >
> > On Wed, Feb 16, 2022 at 2:51 PM Martijn Visser 
> > wrote:
> >
> > > Congratulations to all of you!
> > >
> > > Best regards,
> > >
> > > Martijn Visser
> > > https://twitter.com/MartijnVisser82
> > >
> > >
> > > On Wed, 16 Feb 2022 at 14:29, Fabian Paul  wrote:
> > >
> > > > Congrats to all three of you, well deserved.
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > > On Wed, Feb 16, 2022 at 2:23 PM Robert Metzger 
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to formally announce a few new Flink PMC members on the
> > > dev@
> > > > > list. The PMC has not done a good job of always announcing new PMC
> > > > members
> > > > > (and committers) recently. I'll try to keep an eye on this in the
> > > future
> > > > to
> > > > > improve the situation.
> > > > >
> > > > > Nevertheless, I'm very happy to announce some very active community
> > > > members
> > > > > as new PMC members:
> > > > >
> > > > > - Igal Shilman, added to the PMC in October 2021
> > > > > - Konstantin Knauf, added to the PMC in January 2022
> > > > > - Yun Gao, added to the PMC in February 2022
> > > > >
> > > > > Please join me in welcoming them to the Flink PMC!
> > > > >
> > > > > Best,
> > > > > Robert
> > > >
> > >


Re: [ANNOUNCE] New Apache Flink Committer - Martijn Visser

2022-03-03 Thread Roman Khachatryan
Congratulations Martijn!

Regards,
Roman

On Fri, Mar 4, 2022 at 8:09 AM Sergey Nuyanzin  wrote:
>
> Congratulations Martijn!
>
> On Fri, Mar 4, 2022 at 8:07 AM David Morávek  wrote:
>
> > Congratulations Martijn, well deserved!
> >
> > Best,
> > D.
> >
> > On Fri, Mar 4, 2022 at 7:25 AM Jiangang Liu 
> > wrote:
> >
> > > Congratulations Martijn!
> > >
> > > Best
> > > Liu Jiangang
> > >
> > > Lijie Wang  于2022年3月4日周五 14:00写道:
> > >
> > > > Congratulations Martijn!
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Jingsong Li  于2022年3月4日周五 13:42写道:
> > > >
> > > > > Congratulations Martijn!
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Fri, Mar 4, 2022 at 1:09 PM Yang Wang 
> > > wrote:
> > > > > >
> > > > > > Congratulations Martijn!
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Yangze Guo  于2022年3月4日周五 11:33写道:
> > > > > >
> > > > > > > Congratulations!
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Fri, Mar 4, 2022 at 11:23 AM Lincoln Lee <
> > > lincoln.8...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Congratulations Martijn!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Lincoln Lee
> > > > > > > >
> > > > > > > >
> > > > > > > > Yu Li  于2022年3月4日周五 11:09写道:
> > > > > > > >
> > > > > > > > > Congratulations!
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Yu
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, 4 Mar 2022 at 10:31, Zhipeng Zhang <
> > > > > zhangzhipe...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Congratulations Martijn!
> > > > > > > > > >
> > > > > > > > > > Qingsheng Ren  于2022年3月4日周五 10:14写道:
> > > > > > > > > >
> > > > > > > > > > > Congratulations Martijn!
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > >
> > > > > > > > > > > Qingsheng Ren
> > > > > > > > > > >
> > > > > > > > > > > > On Mar 4, 2022, at 9:56 AM, Leonard Xu <
> > > xbjt...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Congratulations and well deserved Martjin !
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Leonard
> > > > > > > > > > > >
> > > > > > > > > > > >> 2022年3月4日 上午7:55,Austin Cawley-Edwards <
> > > > > austin.caw...@gmail.com
> > > > > > > >
> > > > > > > > > 写道:
> > > > > > > > > > > >>
> > > > > > > > > > > >> Congrats Martijn!
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Thu, Mar 3, 2022 at 10:50 AM Robert Metzger <
> > > > > > > rmetz...@apache.org
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >>> Hi everyone,
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> On behalf of the PMC, I'm very happy to announce
> > > Martijn
> > > > > > > Visser as
> > > > > > > > > a
> > > > > > > > > > > new
> > > > > > > > > > > >>> Flink committer.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Martijn is a very active Flink community member,
> > > driving
> > > > a
> > > > > lot
> > > > > > > of
> > > > > > > > > > > efforts
> > > > > > > > > > > >>> on the dev@flink mailing list. He also pushes
> > projects
> > > > > such as
> > > > > > > > > > > replacing
> > > > > > > > > > > >>> Google Analytics with Matomo, so that we can generate
> > > our
> > > > > web
> > > > > > > > > > analytics
> > > > > > > > > > > >>> within the Apache Software Foundation.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Please join me in congratulating Martijn for
> > becoming a
> > > > > Flink
> > > > > > > > > > > committer!
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Cheers,
> > > > > > > > > > > >>> Robert
> > > > > > > > > > > >>>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > best,
> > > > > > > > > > Zhipeng
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Sergey


Re: [ANNOUNCE] New Apache Flink Committer - David Morávek

2022-03-05 Thread Roman Khachatryan
Congratulations, David!

Regards,
Roman

On Fri, Mar 4, 2022 at 7:54 PM Austin Cawley-Edwards
 wrote:
>
> Congrats David!
>
> On Fri, Mar 4, 2022 at 12:18 PM Zhilong Hong  wrote:
>
> > Congratulations, David!
> >
> > Best,
> > Zhilong
> >
> > On Sat, Mar 5, 2022 at 1:09 AM Piotr Nowojski 
> > wrote:
> >
> > > Congratulations :)
> > >
> > > pt., 4 mar 2022 o 16:04 Aitozi  napisał(a):
> > >
> > > > Congratulations David!
> > > >
> > > > Ingo Bürk  于2022年3月4日周五 22:56写道:
> > > >
> > > > > Congrats, David!
> > > > >
> > > > > On 04.03.22 12:34, Robert Metzger wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > On behalf of the PMC, I'm very happy to announce David Morávek as a
> > > new
> > > > > > Flink committer.
> > > > > >
> > > > > > His first contributions to Flink date back to 2019. He has been
> > > > > > increasingly active with reviews and driving major initiatives in
> > the
> > > > > > community. David brings valuable experience from being a committer
> > in
> > > > the
> > > > > > Apache Beam project to Flink.
> > > > > >
> > > > > >
> > > > > > Please join me in congratulating David for becoming a Flink
> > > committer!
> > > > > >
> > > > > > Cheers,
> > > > > > Robert
> > > > > >
> > > > >
> > > >
> > >
> >


Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

2020-08-01 Thread Roman Khachatryan
Hi Thomas,

Thanks a lot for the analysis.

The first thing that I'd check is whether checkpoints became more frequent
with this commit (as each of them adds at least 500ms if there is at least
one not sent record, according to FlinkKinesisProducer.snapshotState).

Can you share checkpointing statistics (1.10 vs 1.11 or last "good" vs
first "bad" commits)?

On Fri, Jul 31, 2020 at 5:29 AM Thomas Weise  wrote:

> I run git bisect and the first commit that shows the regression is:
>
>
> https://github.com/apache/flink/commit/355184d69a8519d29937725c8d85e8465d7e3a90
>
>
> On Thu, Jul 23, 2020 at 6:46 PM Kurt Young  wrote:
>
> > From my experience, java profilers are sometimes not accurate enough to
> > find out the performance regression
> > root cause. In this case, I would suggest you try out intel vtune
> amplifier
> > to watch more detailed metrics.
> >
> > Best,
> > Kurt
> >
> >
> > On Fri, Jul 24, 2020 at 8:51 AM Thomas Weise  wrote:
> >
> > > The cause of the issue is all but clear.
> > >
> > > Previously I had mentioned that there is no suspect change to the
> Kinesis
> > > connector and that I had reverted the AWS SDK change to no effect.
> > >
> > > https://issues.apache.org/jira/browse/FLINK-17496 actually fixed
> another
> > > regression in the previous release and is present before and after.
> > >
> > > I repeated the run with 1.11.0 core and downgraded the entire Kinesis
> > > connector to 1.10.1: Nothing changes, i.e. the regression is still
> > present.
> > > Therefore we will need to look elsewhere for the root cause.
> > >
> > > Regarding the time spent in snapshotState, repeat runs reveal a wide
> > range
> > > for both versions, 1.10 and 1.11. So again this is nothing pointing to
> a
> > > root cause.
> > >
> > > At this point, I have no ideas remaining other than doing a bisect to
> > find
> > > the culprit. Any other suggestions?
> > >
> > > Thomas
> > >
> > >
> > > On Thu, Jul 16, 2020 at 9:19 PM Zhijiang  > > .invalid>
> > > wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > Thanks for your further profiling information and glad to see we
> > already
> > > > finalized the location to cause the regression.
> > > > Actually I was also suspicious of the point of #snapshotState in
> > previous
> > > > discussions since it indeed cost much time to block normal operator
> > > > processing.
> > > >
> > > > Based on your below feedback, the sleep time during #snapshotState
> > might
> > > > be the main concern, and I also digged into the implementation of
> > > > FlinkKinesisProducer#snapshotState.
> > > > while (producer.getOutstandingRecordsCount() > 0) {
> > > >producer.flush();
> > > >try {
> > > >   Thread.sleep(500);
> > > >} catch (InterruptedException e) {
> > > >   LOG.warn("Flushing was interrupted.");
> > > >   break;
> > > >}
> > > > }
> > > > It seems that the sleep time is mainly affected by the internal
> > > operations
> > > > inside KinesisProducer implementation provided by amazonaws, which I
> am
> > > not
> > > > quite familiar with.
> > > > But I noticed there were two upgrades related to it in
> release-1.11.0.
> > > One
> > > > is for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is
> > for
> > > > upgrading aws-sdk-version to 1.11.754 [2].
> > > > You mentioned that you already reverted the SDK upgrade to verify no
> > > > changes. Did you also revert the [1] to verify?
> > > > [1] https://issues.apache.org/jira/browse/FLINK-17496
> > > > [2] https://issues.apache.org/jira/browse/FLINK-14881
> > > >
> > > > Best,
> > > > Zhijiang
> > > > --
> > > > From:Thomas Weise 
> > > > Send Time:2020年7月17日(星期五) 05:29
> > > > To:dev 
> > > > Cc:Zhijiang ; Stephan Ewen <
> > se...@apache.org
> > > >;
> > > > Arvid Heise ; Aljoscha Krettek <
> > aljos...@apache.org
> > > >
> > > > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
> > release
> > > > candidate #4)
> > > >
> > > > Sorry for the delay.
> > > >
> > > > I confirmed that the regression is due to the sink (unsurprising,
> since
> > > > another job with the same consumer, but not the producer, runs as
> > > > expected).
> > > >
> > > > As promised I did CPU profiling on the problematic application, which
> > > gives
> > > > more insight into the regression [1]
> > > >
> > > > The screenshots show that the average time for snapshotState
> increases
> > > from
> > > > ~9s to ~28s. The data also shows the increase in sleep time during
> > > > snapshotState.
> > > >
> > > > Does anyone, based on changes made in 1.11, have a theory why?
> > > >
> > > > I had previously looked at the changes to the Kinesis connector and
> > also
> > > > reverted the SDK upgrade, which did not change the situation.
> > > >
> > > > It will likely be necessary to drill into the sink / checkpointing
> > > details
> > > > to understand the cause of the problem.
> > > >
> > > > Let me know if anyone has specific questions that I can an

[jira] [Created] (FLINK-14783) Implement a benchmark for `ContinuousFileReaderOperator`

2019-11-14 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-14783:
-

 Summary: Implement a benchmark for `ContinuousFileReaderOperator`
 Key: FLINK-14783
 URL: https://issues.apache.org/jira/browse/FLINK-14783
 Project: Flink
  Issue Type: Test
  Components: Runtime / Task, Tests
Reporter: Roman Khachatryan
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14857) Deprecate checkpoint lock from the Operators API

2019-11-19 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-14857:
-

 Summary: Deprecate checkpoint lock from the Operators API
 Key: FLINK-14857
 URL: https://issues.apache.org/jira/browse/FLINK-14857
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Affects Versions: 1.9.1
Reporter: Roman Khachatryan


Drop `this.checkpointLock = getContainingTask().getCheckpointLock();` access 
(used for example in `ContinuousFileReaderOperator`).

Redirect users to the mailbox (enqueuing/yielding).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15109) InternalTimerServiceImpl references restored state after use, taking up resources unnecessarily

2019-12-06 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-15109:
-

 Summary: InternalTimerServiceImpl references restored state after 
use, taking up resources unnecessarily
 Key: FLINK-15109
 URL: https://issues.apache.org/jira/browse/FLINK-15109
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Affects Versions: 1.9.1
Reporter: Roman Khachatryan


E.g. 
org.apache.flink.streaming.api.operators.InternalTimerServiceImpl#restoredTimersSnapshot:
 # written in restoreTimersForKeyGroup()

 # used in startTimerService()

 # and then never used again.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-33442) UnsupportedOperationException thrown from RocksDBIncrementalRestoreOperation

2023-11-02 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-33442:
-

 Summary: UnsupportedOperationException thrown from 
RocksDBIncrementalRestoreOperation
 Key: FLINK-33442
 URL: https://issues.apache.org/jira/browse/FLINK-33442
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.17.1
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.17.2


When using the new rescaling API, it's possible to get
{code:java}
2023-10-31 18:25:05,179 ERROR 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder [] - 
Caught unexpected exception.
java.lang.UnsupportedOperationException: null
at java.util.Collections$1.remove(Collections.java:4714) ~[?:?]
at java.util.AbstractCollection.remove(AbstractCollection.java:299) 
~[?:?]
at 
org.apache.flink.runtime.checkpoint.StateObjectCollection.remove(StateObjectCollection.java:105)
 ~[flink-runtime-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithRescaling(RocksDBIncrementalRestoreOperation.java:294)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:167)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:327)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:512)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:99)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:338)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:355)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:166)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:256)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:735)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:710)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:676)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
 [flink-runtime-1.17.1-143.jar:1.17.1-143]
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921) 
[flink-runtime-1.17.1-143.jar:1.17.1-143]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) 
[flink-runtime-1.17.1-143.jar:1.17.1-143]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) 
[flink-runtime-1.17.1-143.jar:1.17.1-143]
at java.lang.Thread.run(Thread.java:829) [?:?]
2023-10-31 18:25:05,182 WARN  
org.apache.flink.streaming.api.operators.BackendRestorerProcedure [] - 
Exception while restoring keyed state backend for 
KeyedProcessOperator_353a6b34b8b7f1c1d0fb4616d911049c_(1/2) from alternative 
(1/2), will retry while more alternatives are available.
org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected 
exception.
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:407)
 ~[flink-dist-1.17.1-143.jar:1.17.1-143]
 

[jira] [Created] (FLINK-33590) CheckpointStatsTracker.totalNumberOfSubTasks not updated

2023-11-17 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-33590:
-

 Summary: CheckpointStatsTracker.totalNumberOfSubTasks not updated
 Key: FLINK-33590
 URL: https://issues.apache.org/jira/browse/FLINK-33590
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.18.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.18.1


On rescaling, the DoP is obtained from the JobGraph. 
However, JobGraph vertices are not updated once created. This results in 
missing traces on rescaling (isComplete returns false).

Instead, it should be obtained from DoP store.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31261) Make AdaptiveScheduler aware of the (local) state size

2023-02-28 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-31261:
-

 Summary: Make AdaptiveScheduler aware of the (local) state size
 Key: FLINK-31261
 URL: https://issues.apache.org/jira/browse/FLINK-31261
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.18.0
Reporter: Roman Khachatryan
Assignee: Roman Khachatryan
 Fix For: 1.18.0


FLINK-21450 makes the Adaptive Schulder aware of Local Recovery.

Each slot-group pair is assigned a score based on a keyGroupRange size.

That score isn't always optimlal - it could be improved by computing the score 
based on the actual state size on disk.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31601) While waiting for resources, resources check might be scheduled unlimited number of times (Adaptive Scheduler)

2023-03-23 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-31601:
-

 Summary: While waiting for resources, resources check might be 
scheduled unlimited number of times (Adaptive Scheduler)
 Key: FLINK-31601
 URL: https://issues.apache.org/jira/browse/FLINK-31601
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.17.0
Reporter: Roman Khachatryan
 Fix For: 1.17.1


See [https://github.com/apache/flink/pull/22169#discussion_r1136395017]
{quote}when {{resourceStabilizationDeadline}} is not null, should we skip 
scheduling {{checkDesiredOrSufficientResourcesAvailable}} (on [line 
166|https://github.com/apache/flink/blob/a64781b1ef8f129021bdcddd3b07548e6caa4a72/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/WaitingForResources.java#L166])?
Otherwise, we schedule as many checks as there are changes in resources.
{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   >