Re: [VOTE] Release 1.19.1, release candidate #1

2024-06-11 Thread Matthias Pohl
+1 (binding)

* Downloaded all artifacts
* Extracted sources and ran compilation on sources
* Diff of git tag checkout with downloaded sources
* Verified SHA512 & GPG checksums
* Checked that all POMs have the right expected version
* Generated diffs to compare pom file changes with NOTICE files
* Verified WordCount in batch mode and streaming mode with a standalone
session cluster to verify the logs: no suspicious behavior observed

Best,
Matthias

On Mon, Jun 10, 2024 at 12:54 PM Hong Liang  wrote:

> Thanks for testing the release candidate, everyone. Nice to see coverage on
> different types of testing being done.
>
> I've addressed the comments on the web PR - thanks Rui Fan for good
> comments, and for the reminder from Ahmed :)
>
> We have <24 hours on the vote wait time, and still waiting on 1 more
> binding vote!
>
> Regards,
> Hong
>
> On Sat, Jun 8, 2024 at 11:33 PM Ahmed Hamdy  wrote:
>
> > Hi Hong,
> > Thanks for driving
> >
> > +1 (non-binding)
> >
> > - Verified signatures and hashes
> > - Checked github release tag
> > - Verified licenses
> > - Checked that the source code does not contain binaries
> > - Reviewed Web PR, nit: Could we address the comment of adding
> FLINK-34633
> > in the release
> >
> >
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Sat, 8 Jun 2024 at 22:22, Jeyhun Karimov 
> wrote:
> >
> > > Hi Hong,
> > >
> > > Thanks for driving the release.
> > > +1 (non-binding)
> > >
> > > - Verified gpg signature
> > > - Reviewed the PR
> > > - Verified sha512
> > > - Checked github release tag
> > > - Checked that the source code does not contain binaries
> > >
> > > Regards,
> > > Jeyhun
> > >
> > > On Sat, Jun 8, 2024 at 1:52 PM weijie guo 
> > > wrote:
> > >
> > > > Thanks Hong!
> > > >
> > > > +1(binding)
> > > >
> > > > - Verified gpg signature
> > > > - Verified sha512 hash
> > > > - Checked gh release tag
> > > > - Checked all artifacts deployed to maven repo
> > > > - Ran a simple wordcount job on local standalone cluster
> > > > - Compiled from source code with JDK 1.8.0_291.
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > Xiqian YU  于2024年6月7日周五 18:23写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > >
> > > > >   *   Checked download links & release tags
> > > > >   *   Verified that package checksums matched
> > > > >   *   Compiled Flink from source code with JDK 8 / 11
> > > > >   *   Ran E2e data integration test jobs on local cluster
> > > > >
> > > > > Regards,
> > > > > yux
> > > > >
> > > > > De : Rui Fan <1996fan...@gmail.com>
> > > > > Date : vendredi, 7 juin 2024 à 17:14
> > > > > À : dev@flink.apache.org 
> > > > > Objet : Re: [VOTE] Release 1.19.1, release candidate #1
> > > > > +1(binding)
> > > > >
> > > > > - Reviewed the flink-web PR (Left some comments)
> > > > > - Checked Github release tag
> > > > > - Verified signatures
> > > > > - Verified sha512 (hashsums)
> > > > > - The source archives do not contain any binaries
> > > > > - Build the source with Maven 3 and java8 (Checked the license as
> > well)
> > > > > - Start the cluster locally with jdk8, and run the
> > StateMachineExample
> > > > job,
> > > > > it works fine.
> > > > >
> > > > > Best,
> > > > > Rui
> > > > >
> > > > > On Thu, Jun 6, 2024 at 11:39 PM Hong Liang 
> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > > Please review and vote on the release candidate #1 for the flink
> > > > v1.19.1,
> > > > > > as follows:
> > > > > > [ ] +1, Approve the release
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > >
> > > > > > The complete staging area is available for your review, which
> > > includes:
> > > > > > * JIRA release notes [1],
> > > > > > * the official Apache source release and binary convenience
> > releases
> > > to
> > > > > be
> > > > > > deployed to dist.apache.org [2], which are signed with the key
> > with
> > > > > > fingerprint B78A5EA1 [3],
> > > > > > * all artifacts to be deployed to the Maven Central Repository
> [4],
> > > > > > * source code tag "release-1.19.1-rc1" [5],
> > > > > > * website pull request listing the new release and adding
> > > announcement
> > > > > blog
> > > > > > post [6].
> > > > > >
> > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > majority
> > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > >
> > > > > > Thanks,
> > > > > > Hong
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354399
> > > > > > [2]
> https://dist.apache.org/repos/dist/dev/flink/flink-1.19.1-rc1/
> > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > [4]
> > > > > >
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1736/
> > > > > > [5]
> > https://github.com/apache/flink/releases/tag/release-1.19.1-rc1
> > > > > > [6] https://github.com/apache/flink-web/pull/745
> > > 

Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

2024-06-07 Thread Matthias Pohl
Hi Zakelly,
good point. I updated the FLIP to use "scale-on-failed-checkpoints-count"
and "max-delay-for-scale-trigger".

On Fri, Jun 7, 2024 at 12:18 PM Zakelly Lan  wrote:

> Hi Matthias,
>
> Thanks for your reply!
>
> That's something that could be considered as another optimization. I would
> > consider this as a possible follow-up. My concern here is that we'd make
> > the rescaling configuration even more complicated by introducing yet
> > another parameter.
>
>
> I'd be fine with considering this as a follow-up.
>
> It might be worth renaming the internal interface into something that
> > indicates its internal usage to avoid confusion.
> >
>
> Agree with this.
>
> And another question:
> I noticed the existing options under 'jobmanager.adaptive-scheduler' are
> using the word 'scaling', e.g.
> 'jobmanager.adaptive-scheduler.scaling-interval.min'. While in this FLIP
> you choose 'rescale'. Would you mind unifying them?
>
>
> Best,
> Zakelly
>
>
> On Thu, Jun 6, 2024 at 10:57 PM David Morávek 
> wrote:
>
> > Thanks for the FLIP Matthias, I think it looks pretty solid!
> >
> > I also don't see a relation to unaligned checkpoints. From the AS
> > perspective, the checkpoint time doesn't matter.
> >
> > Is it possible a change event observed right after a complete checkpoint
> > > (or within a specific short time after a checkpoint) that triggers a
> > > rescale immediately? Sometimes the checkpoint interval is huge and it
> is
> > > better to rescale immediately.
> > >
> >
> > I had considered this initially too, but it feels like a possible
> follow-up
> > optimization.
> >
> > The primary objective of the proposed solution is to enhance overall
> > predictability. With a longer checkpointing interval, the current
> situation
> > worsens as we might have to reprocess a substantial backlog.
> >
> > I think in the future we might actually want to enhance this by
> triggering
> > some kind of specialized "rescaling" checkpoint that prepares the cluster
> > for rescaling (eg. by replicating state to new slots / pre-splitting the
> > db, ...), to make things faster.
> >
> > Best,
> > D.
> >
> > On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl  wrote:
> >
> > > Hi Zakelly,
> > > thanks for your reply. See my inlined responses below:
> > >
> > > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan 
> > wrote:
> > >
> > > > Hi Matthias,
> > > >
> > > > Thanks for your proposal! I have a few questions:
> > > >
> > > > 1. Is it possible a change event observed right after a complete
> > > checkpoint
> > > > (or within a specific short time after a checkpoint) that triggers a
> > > > rescale immediately? Sometimes the checkpoint interval is huge and it
> > is
> > > > better to rescale immediately.
> > > >
> > >
> > > That's something that could be considered as another optimization. I
> > would
> > > consider this as a possible follow-up. My concern here is that we'd
> make
> > > the rescaling configuration even more complicated by introducing yet
> > > another parameter.
> > >
> > >
> > > > 2. Should we introduce `CheckpointLifecycleListener` instead of
> reusing
> > > > `CheckpointListener`? Is `CheckpointListener` enough for this
> scenario?
> > > >
> > >
> > > Good point, they are serving similar purposes. But I'm hesitant to use
> > > CheckpointListener (which is a public interface) for this internal
> quite
> > > narrowly scoped runtime-specific use case of FLIP-461.
> > >
> > > It might be worth renaming the internal interface into something that
> > > indicates its internal usage to avoid confusion.
> > >
> > >
> > > > Best,
> > > > Zakelly
> > > >
> > > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl 
> > wrote:
> > > >
> > > > > Hi ConradJam,
> > > > > thanks for your response.
> > > > >
> > > > > The CheckpointStatsTracker gets notified about the checkpoint
> > > completion
> > > > > after the checkpoint is finalized, i.e. all its data is persisted
> and
> > > the
> > > > > metadata is written to the CompletedCheckpointStore. At this
> moment,
> > > the
> > > > > checkpoint is considered for restoring a j

[jira] [Created] (FLINK-35553) Integrate newly added trigger interface with checkpointing

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35553:
-

 Summary: Integrate newly added trigger interface with checkpointing
 Key: FLINK-35553
 URL: https://issues.apache.org/jira/browse/FLINK-35553
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Coordination
Reporter: Matthias Pohl


This connects the newly introduced trigger logic (FLINK-35551) with the newly 
added checkpoint lifecycle listening feature (FLINK-35552).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35552) Move CheckpointStatsTracker out of ExecutionGraph into Scheduler

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35552:
-

 Summary: Move CheckpointStatsTracker out of ExecutionGraph into 
Scheduler
 Key: FLINK-35552
 URL: https://issues.apache.org/jira/browse/FLINK-35552
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / Coordination
Reporter: Matthias Pohl


The scheduler needs to know about the CheckpointStatsTracker to allow listening 
to checkpoint failures and completion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35551) Introduces RescaleManager#onTrigger endpoint

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35551:
-

 Summary: Introduces RescaleManager#onTrigger endpoint
 Key: FLINK-35551
 URL: https://issues.apache.org/jira/browse/FLINK-35551
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias Pohl


The new endpoint would allow use from separating observing change events from 
actually triggering the rescale operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35550) Introduce new component RescaleManager

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35550:
-

 Summary: Introduce new component RescaleManager
 Key: FLINK-35550
 URL: https://issues.apache.org/jira/browse/FLINK-35550
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Matthias Pohl


The goal here is to collect the rescaling logic in a single component to 
improve testability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35549) FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing for the AdaptiveScheduler

2024-06-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35549:
-

 Summary: FLIP-461: Synchronize rescaling with checkpoint creation 
to minimize reprocessing for the AdaptiveScheduler
 Key: FLINK-35549
 URL: https://issues.apache.org/jira/browse/FLINK-35549
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


This is the umbrella issue for implementing 
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Savepoints not considered during failover

2024-06-07 Thread Matthias Pohl
One reason could be that the savepoints are self-contained, owned by the
user rather than Flink and, therefore, could be moved. Flink wouldn't have
a proper reference in that case anymore.

I don't have a link to a discussion, though.

Best,
Matthias

On Fri, Jun 7, 2024 at 8:47 AM Gyula Fóra  wrote:

> Hey Devs!
>
> What is the reason / rationale for savepoints being ignored during failover
> scenarios?
>
> I see they are not even recorded as the last valid checkpoint in the HA
> metadata (only the checkpoint id counter is bumped) so if the JM fails
> after a manually triggered savepoint the job will still fall back to the
> previous checkpoint instead.
>
> I am sure there must have been some discussion around it but I cant find
> it.
>
> Thank you!
> Gyula
>


Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

2024-06-05 Thread Matthias Pohl
Hi Zakelly,
thanks for your reply. See my inlined responses below:

On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan  wrote:

> Hi Matthias,
>
> Thanks for your proposal! I have a few questions:
>
> 1. Is it possible a change event observed right after a complete checkpoint
> (or within a specific short time after a checkpoint) that triggers a
> rescale immediately? Sometimes the checkpoint interval is huge and it is
> better to rescale immediately.
>

That's something that could be considered as another optimization. I would
consider this as a possible follow-up. My concern here is that we'd make
the rescaling configuration even more complicated by introducing yet
another parameter.


> 2. Should we introduce `CheckpointLifecycleListener` instead of reusing
> `CheckpointListener`? Is `CheckpointListener` enough for this scenario?
>

Good point, they are serving similar purposes. But I'm hesitant to use
CheckpointListener (which is a public interface) for this internal quite
narrowly scoped runtime-specific use case of FLIP-461.

It might be worth renaming the internal interface into something that
indicates its internal usage to avoid confusion.


> Best,
> Zakelly
>
> On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl  wrote:
>
> > Hi ConradJam,
> > thanks for your response.
> >
> > The CheckpointStatsTracker gets notified about the checkpoint completion
> > after the checkpoint is finalized, i.e. all its data is persisted and the
> > metadata is written to the CompletedCheckpointStore. At this moment, the
> > checkpoint is considered for restoring a job and, therefore, becomes
> > available for restarts. This workflow also applies to unaligned
> > checkpoints. But I see how this context might be helpful for
> understanding
> > the change. I will add it to the FLIP. So far, I don't see a reason
> > to disable the feature for unaligned checkpoints. Do you see other issues
> > that might make it necessary to disable this feature for this type of
> > checkpoints?
> >
> > Can you elaborate a bit more what you mean by "checkpoints that do not
> > check it"? I do not fully understand what you are referring to with "it"
> > here.
> >
> > Best,
> > Matthias
> >
> > On Wed, Jun 5, 2024 at 4:46 AM ConradJam  wrote:
> >
> > > I have a few questions:
> > > Unaligned checkpoints Do we need to enable this feature? Whether this
> > > feature should be disabled for checkpoints that do not check it
> > >
> > > Matthias Pohl  于2024年6月4日周二 18:03写道:
> > >
> > > > Hi everyone,
> > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the
> synchronization
> > > of
> > > > rescaling and the completion of checkpoints. The idea is to reduce
> the
> > > > amount of data that needs to be processed after rescaling happened. A
> > > more
> > > > detailed motivation can be found in FLIP-461.
> > > >
> > > > I'm looking forward to feedback and suggestions.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> > > >
> > >
> > >
> > > --
> > > Best
> > >
> > > ConradJam
> > >
> >
>


Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

2024-06-05 Thread Matthias Pohl
Thanks Rui for your reply. Find my answers inlined below:

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout

On Wed, Jun 5, 2024 at 10:16 AM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Matthias for driving this proposal!
>
> This proposal can reduce the amount of data that is processed repeatedly
> after rescaling, so this proposal makes sense to me.
>
> I have some questions:
> 1. The public change only includes the "New Configuration Parameters"
> part, right?
>

Correct. I updated the section to make this a bit clearer.


> 2. jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count is
> obviously a config option for users. But I'm not sure whether
> jobmanager.adaptive-scheduler.max-delay-for-rescale-trigger is a config
> option or an internal logic? I saw it's computed by
> rescale-on-failed-checkpoints-count.
>

That's a fair point. I wanted the user to be able to go back to the old
implementation even if checkpointing is enabled. One could argue that there
is no need for the parameter being expressed through a Duration. The only
motivation of delaying the rescaling might be waiting for consecutive
change events (which is similar to what we already have with
resource-stabilization-timeout that is utilized in the WaitingForResource
state [1]). Maybe, let's wait for other feedback here.


> 3. I'm not sure if the default value of rescale-on-failed-checkpoints-count
> should be 1 or is greater than 1 better?
>If 1 as the default value, when the checkpoint fails occasionally, and
> rescale happens, flink job will process a series of repeated data as well.
>If 2 as the default value, when the checkpoint fails occasionally, and
> the next checkpoint succeeds, the flink job won't process repeated data.
>

You're right. Using 2 as a default value sounds reasonable to work around
occasional "hiccups". My main motivation to set it to 1 was to be as close
as possible to the current (pre-FLIP-461) behavior where the rescale
happens immediately.

But I start to lean towards following your proposal here. I won't update
the FLIP in this regard for now to see what others have to say.

4. The description of rescale-on-failed-checkpoints-count is
>   "The number of subsequent failed checkpoints that will initiate
> rescaling."
>   IIUC, the "consecutive" is more accurate than subsequent here. WDYT?
>

Good idea. I will update the FLIP accordingly.


> 5. Proposed Changes part is specific implementation, I'm not sure whether
>all internal interfaces are best for the current version. So I cannot
> give any suggestion or feedback for now. But I'm happy to review them when
> your PR is ready if I have time.
>   Feel free to cc me (I'm interested in Adaptive Scheduler)
>

Will do.


> 6. This proposal aims to improve one logic inside of Adaptive Scheduler.
>Would you mind mentioning Adaptive Scheduler in the FLIP title? It will
>be useful for users to understand which component this proposal belongs
> to.
>

Good point. I updated the FLIPs title.


>
> Also, I also don't understand why this proposal needs to care about the
> checkpoint type is unaligned checkpoint or aligned checkpoint.
>
> Please correct me if anything is wrong, thanks.
>
> Best,
> Rui

On Wed, Jun 5, 2024 at 3:01 PM Matthias Pohl  wrote:
>
> > Hi ConradJam,
> > thanks for your response.
> >
> > The CheckpointStatsTracker gets notified about the checkpoint completion
> > after the checkpoint is finalized, i.e. all its data is persisted and the
> > metadata is written to the CompletedCheckpointStore. At this moment, the
> > checkpoint is considered for restoring a job and, therefore, becomes
> > available for restarts. This workflow also applies to unaligned
> > checkpoints. But I see how this context might be helpful for
> understanding
> > the change. I will add it to the FLIP. So far, I don't see a reason
> > to disable the feature for unaligned checkpoints. Do you see other issues
> > that might make it necessary to disable this feature for this type of
> > checkpoints?
> >
> > Can you elaborate a bit more what you mean by "checkpoints that do not
> > check it"? I do not fully understand what you are referring to with "it"
> > here.
> >
> > Best,
> > Matthias
> >
> > On Wed, Jun 5, 2024 at 4:46 AM ConradJam  wrote:
> >
> > > I have a few questions:
> > > Unaligned checkpoints Do we need to enable this feature? Whether this
> > > feature should be disabled for checkpoints that do not check it
> > >
> > > Matthias Pohl  于2024年6月4日周二 18:03写道:

Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

2024-06-05 Thread Matthias Pohl
Hi ConradJam,
thanks for your response.

The CheckpointStatsTracker gets notified about the checkpoint completion
after the checkpoint is finalized, i.e. all its data is persisted and the
metadata is written to the CompletedCheckpointStore. At this moment, the
checkpoint is considered for restoring a job and, therefore, becomes
available for restarts. This workflow also applies to unaligned
checkpoints. But I see how this context might be helpful for understanding
the change. I will add it to the FLIP. So far, I don't see a reason
to disable the feature for unaligned checkpoints. Do you see other issues
that might make it necessary to disable this feature for this type of
checkpoints?

Can you elaborate a bit more what you mean by "checkpoints that do not
check it"? I do not fully understand what you are referring to with "it"
here.

Best,
Matthias

On Wed, Jun 5, 2024 at 4:46 AM ConradJam  wrote:

> I have a few questions:
> Unaligned checkpoints Do we need to enable this feature? Whether this
> feature should be disabled for checkpoints that do not check it
>
> Matthias Pohl  于2024年6月4日周二 18:03写道:
>
> > Hi everyone,
> > I'd like to discuss FLIP-461 [1]. The FLIP proposes the synchronization
> of
> > rescaling and the completion of checkpoints. The idea is to reduce the
> > amount of data that needs to be processed after rescaling happened. A
> more
> > detailed motivation can be found in FLIP-461.
> >
> > I'm looking forward to feedback and suggestions.
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> >
>
>
> --
> Best
>
> ConradJam
>


[DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

2024-06-04 Thread Matthias Pohl
Hi everyone,
I'd like to discuss FLIP-461 [1]. The FLIP proposes the synchronization of
rescaling and the completion of checkpoints. The idea is to reduce the
amount of data that needs to be processed after rescaling happened. A more
detailed motivation can be found in FLIP-461.

I'm looking forward to feedback and suggestions.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing


Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo

2024-06-04 Thread Matthias Pohl
Congratulations, Weijie!

Matthias

On Tue, Jun 4, 2024 at 11:12 AM Guowei Ma  wrote:

> Congratulations!
>
> Best,
> Guowei
>
>
> On Tue, Jun 4, 2024 at 4:55 PM gongzhongqiang 
> wrote:
>
> > Congratulations Weijie! Best,
> > Zhongqiang Gong
> >
> > Xintong Song  于2024年6月4日周二 14:46写道:
> >
> > > Hi everyone,
> > >
> > > On behalf of the PMC, I'm very happy to announce that Weijie Guo has
> > joined
> > > the Flink PMC!
> > >
> > > Weijie has been an active member of the Apache Flink community for many
> > > years. He has made significant contributions in many components,
> > including
> > > runtime, shuffle, sdk, connectors, etc. He has driven / participated in
> > > many FLIPs, authored and reviewed hundreds of PRs, been consistently
> > active
> > > on mailing lists, and also helped with release management of 1.20 and
> > > several other bugfix releases.
> > >
> > > Congratulations and welcome Weijie!
> > >
> > > Best,
> > >
> > > Xintong (on behalf of the Flink PMC)
> > >
> >
>


Re: [DISCUSS] Proposing an LTS Release for the 1.x Line

2024-05-27 Thread Matthias Pohl
hat maintainers will have to spend time on two release
> > > > versions.
> > > > > > As
> > > > > > > > > > the codebases diverge more and more, this will just
> become
> > > > > > > > > > increasingly more complex.
> > > > > > > > > >
> > > > > > > > > > With that being said, I do think that it makes sense to
> > also
> > > > > > > formalize
> > > > > > > > > > the result of this discussion in a FLIP. That's just
> easier
> > > to
> > > > > > point
> > > > > > > > > > users towards at a later stage.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > > Martijn
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 4, 2023 at 9:55 PM Alexander Fedulov
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > As we progress with the 1.19 release, which might
> > > potentially
> > > > > > > > (although
> > > > > > > > > > not
> > > > > > > > > > > likely) be the last in the 1.x line, I'd like to revive
> > our
> > > > > > > > discussion on
> > > > > > > > > > > the
> > > > > > > > > > > LTS support matter. There is a general consensus that
> due
> > > to
> > > > > > > > breaking API
> > > > > > > > > > > changes in 2.0, extending bug fixes support by
> > designating
> > > an
> > > > > LTS
> > > > > > > > release
> > > > > > > > > > > is
> > > > > > > > > > > something we want to do.
> > > > > > > > > > >
> > > > > > > > > > > To summarize, the approaches we've considered are:
> > > > > > > > > > >
> > > > > > > > > > > Time-based: The last release of the 1.x line gets a
> clear
> > > > > > > end-of-life
> > > > > > > > > > date
> > > > > > > > > > > (2 years).
> > > > > > > > > > > Release-based: The last release of the 1.x line gets
> > > support
> > > > > for
> > > > > > 4
> > > > > > > > minor
> > > > > > > > > > > releases in the 2.x line. The exact time is unknown,
> but
> > we
> > > > > > assume
> > > > > > > > it to
> > > > > > > > > > be
> > > > > > > > > > > roughly 2 years.
> > > > > > > > > > > LTS-to-LTS release: The last release of the 1.x line is
> > > > > supported
> > > > > > > > until
> > > > > > > > > > the
> > > > > > > > > > > last release in the 2.x line is designated as LTS.
> > > > > > > > > > >
> > > > > > > > > > > We need to strike a balance between being user-friendly
> > and
> > > > > > nudging
> > > > > > > > > > people
> > > > > > > > > > > to
> > > > > > > > > > > upgrade. From that perspective, option 1 is my
> favorite -
> > > we
> > > > > all
> > > > > > > know
> > > > > > > > > > that
> > > > > > > > > > > having a clear deadline works wonders in motivating
> > action.
> > > > At
> > > > > > the
> > > > > > > > same
> > > > > > > > > > > time,
> > > > > > > > > > > I appreciate that we might not want to introduce new
> > kinds
> > > of
> > > > > > > > procedures,
> > > > > > > > > > > so
> 

[FYI] The Azure CI for PRs is currently not triggered

2024-04-04 Thread Matthias Pohl
Hi everyone,
just for your information: The Azure CI for PRs is currently not working.
This started to happen on Tuesday (April 2 at around 7pm (CEST)).
FLINK-34999 [1] covers the issue.

We're expecting the issue to be gone by today. But in the meantime, these
are the things you can do:
1. Wait for FLINK-34999 to be fixed before merging your PR.
2. Check the GHA workflow run for the PR and commit in your fork (and share
the link in your PR for documentation).
3. Azure Pipelines CI is still triggered for pushes to master and the
release branches [2], i.e. if you decide to merge, monitor these builds
closely.

That said, option (1), i.e. waiting for FLINK-34999 to be fixed, is still
the preferred way considering that we have Azure Pipelines defined as our
ground of truth for now and that the issue is going to be fixed today,
hopefully. Additionally, merging a change without a CI run isn't the best
option, either. But I still want to be transparent about your options.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-34999
[2]
https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=1&_a=summary

-- 

[image: Aiven] <https://www.aiven.io>

*Matthias Pohl*
Opensource Software Engineer, *Aiven*
matthias.p...@aiven.io|  +49 170 9869525
aiven.io <https://www.aiven.io>   |   <https://www.facebook.com/aivencloud>
  <https://www.linkedin.com/company/aiven/>   <https://twitter.com/aiven_io>
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


[jira] [Created] (FLINK-35000) PullRequest template doesn't use the correct format to refer to the testing code convention

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-35000:
-

 Summary: PullRequest template doesn't use the correct format to 
refer to the testing code convention
 Key: FLINK-35000
 URL: https://issues.apache.org/jira/browse/FLINK-35000
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Project Website
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The PR template refers to 
https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
 rather than 
https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34999) PR CI stopped operating

2024-04-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34999:
-

 Summary: PR CI stopped operating
 Key: FLINK-34999
 URL: https://issues.apache.org/jira/browse/FLINK-34999
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There are no [new PR CI 
runs|https://dev.azure.com/apache-flink/apache-flink/_build?definitionId=2] 
being picked up anymore. [Recently updated 
PRs|https://github.com/apache/flink/pulls?q=sort%3Aupdated-desc] are not picked 
up by the @flinkbot.

In the meantime there was a notification sent from GitHub that the password of 
the @flinkbot was reset for security reasons. It's quite likely that these two 
events are related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34989) Apache Infra requests to reduce the runner usage for a project

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34989:
-

 Summary: Apache Infra requests to reduce the runner usage for a 
project
 Key: FLINK-34989
 URL: https://issues.apache.org/jira/browse/FLINK-34989
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The GitHub Actions CI utilizes runners that are hosted by Apache Infra right 
now. These runners are limited. The runner usage can be monitored via the 
following links:
* [Flink-specific 
report|https://infra-reports.apache.org/#ghactions=flink=168] 
(needs ASF committer rights) This project-specific report can only be modified 
through the HTTP GET parameters of the URL.
* [Global report|https://infra-reports.apache.org/#ghactions] (needs ASF 
membership)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34988) Class loading issues in JDK17 and JDK21

2024-04-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34988:
-

 Summary: Class loading issues in JDK17 and JDK21
 Key: FLINK-34988
 URL: https://issues.apache.org/jira/browse/FLINK-34988
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.20.0
Reporter: Matthias Pohl


* JDK 17 (core; NoClassDefFoundError caused by ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=675bf62c-8558-587e-2555-dcad13acefb5=5878eed3-cc1e-5b12-1ed0-9e7139ce0992=12942
* JDK 17 (misc; ExceptionInInitializeError): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d871f0ce-7328-5d00-023b-e7391f5801c8=77cbea27-feb9-5cf5-53f7-3267f9f9c6b6=22548
* JDK 21 (core; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=d06b80b4-9e88-5d40-12a2-18072cf60528=609ecd5a-3f6e-5d0c-2239-2096b155a4d0=12963
* JDK 21 (misc; same as above): 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58676=logs=59a2b95a-736b-5c46-b3e0-cee6e587fd86=c301da75-e699-5c06-735f-778207c16f50=22506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34961) GitHub Actions statistcs can be monitored per workflow name

2024-03-28 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34961:
-

 Summary: GitHub Actions statistcs can be monitored per workflow 
name
 Key: FLINK-34961
 URL: https://issues.apache.org/jira/browse/FLINK-34961
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Matthias Pohl


Apache Infra allows the monitoring of runner usage per workflow (see [report 
for 
Flink|https://infra-reports.apache.org/#ghactions=flink=168=10];
  only accessible with Apache committer rights). They accumulate the data by 
workflow name. The Flink space has multiple repositories that use the generic 
workflow name {{CI}}). That makes the differentiation in the report harder.

This Jira issue is about identifying all Flink-related projects with a CI 
workflow (Kubernetes operator and the JDBC connector were identified, for 
instance) and adding a more distinct name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34940) LeaderContender implementations handle invalid state

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34940:
-

 Summary: LeaderContender implementations handle invalid state
 Key: FLINK-34940
 URL: https://issues.apache.org/jira/browse/FLINK-34940
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Matthias Pohl


Currently, LeaderContender implementations (e.g. see 
[ResourceManagerServiceImplTest#grantLeadership_withExistingLeader_waitTerminationOfExistingLeader|https://github.com/apache/flink/blob/master/flink-runtime/src/test/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImplTest.java#L219])
 allow the handling of leader events of the same type happening after each 
other which shouldn't be the case.

Two subsequent leadership grants indicate that the leading instance which 
received the leadership grant again missed the leadership revocation event 
causing an invalid state of the overall deployment (i.e. split brain scenario). 
We should fail fatally in these scenarios rather than handling them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34939) Harden TestingLeaderElection

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34939:
-

 Summary: Harden TestingLeaderElection
 Key: FLINK-34939
 URL: https://issues.apache.org/jira/browse/FLINK-34939
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The {{TestingLeaderElection}} implementation does not follow the interface 
contract of {{LeaderElection}} in all of its facets (e.g. leadership acquire 
and revocation events should be alternating).

This issue is about hardening {{LeaderElection}} contract in the test 
implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34937) Apache Infra GHA policy update

2024-03-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34937:
-

 Summary: Apache Infra GHA policy update
 Key: FLINK-34937
 URL: https://issues.apache.org/jira/browse/FLINK-34937
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There is a policy update [announced in the infra 
ML|https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1] which 
asked Apache projects to limit the number of runners per job. Additionally, the 
[GHA policy|https://infra.apache.org/github-actions-policy.html] is referenced 
which I wasn't aware of when working on the action workflow.

This issue is about applying the policy to the Flink GHA workflows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34933) JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored isn't implemented properly

2024-03-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34933:
-

 Summary: 
JobMasterServiceLeadershipRunnerTest#testResultFutureCompletionOfOutdatedLeaderIsIgnored
 isn't implemented properly
 Key: FLINK-34933
 URL: https://issues.apache.org/jira/browse/FLINK-34933
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.17.2, 1.20.0
Reporter: Matthias Pohl


{{testResultFutureCompletionOfOutdatedLeaderIsIgnored}} doesn't test the 
desired behavior: The {{TestingJobMasterService#closeAsync()}} callback throws 
an {{UnsupportedOperationException}} by default which prevents the test from 
properly finalizing the leadership revocation.

The test is still passing because the test checks implicitly for this error. 
Instead, we should verify that the runner's resultFuture doesn't complete until 
the runner is closed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34921) SystemProcessingTimeServiceTest fails due to missing output

2024-03-22 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34921:
-

 Summary: SystemProcessingTimeServiceTest fails due to missing 
output
 Key: FLINK-34921
 URL: https://issues.apache.org/jira/browse/FLINK-34921
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.20.0
Reporter: Matthias Pohl


This PR CI build with {{AdaptiveScheduler}} enabled failed:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58476=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=11224

{code}
"ForkJoinPool-61-worker-25" #863 daemon prio=5 os_prio=0 tid=0x7f8c19eba000 
nid=0x60a5 waiting on condition [0x7f8bc2cf9000]
Mar 21 17:19:42java.lang.Thread.State: WAITING (parking)
Mar 21 17:19:42 at sun.misc.Unsafe.park(Native Method)
Mar 21 17:19:42 - parking to wait for  <0xd81959b8> (a 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
Mar 21 17:19:42 at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
Mar 21 17:19:42 at 
java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
Mar 21 17:19:42 at 
java.util.concurrent.FutureTask.get(FutureTask.java:191)
Mar 21 17:19:42 at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest$$Lambda$1443/1477662666.call(Unknown
 Source)
Mar 21 17:19:42 at 
org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:63)
Mar 21 17:19:42 at 
org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:892)
Mar 21 17:19:42 at 
org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1366)
Mar 21 17:19:42 at 
org.assertj.core.api.Assertions.assertThatThrownBy(Assertions.java:1210)
Mar 21 17:19:42 at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeServiceTest.testQuiesceAndAwaitingCancelsScheduledAtFixRateFuture(SystemProcessingTimeServiceTest.java:92)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34897) JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip needs to be enabled again

2024-03-20 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34897:
-

 Summary: 
JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip
 needs to be enabled again
 Key: FLINK-34897
 URL: https://issues.apache.org/jira/browse/FLINK-34897
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.17.2, 1.20.0
Reporter: Matthias Pohl


While working on FLINK-34672 I noticed that 
{{JobMasterServiceLeadershipRunnerTest#testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip}}
 is disabled without a reason.

It looks like I disabled it accidentally as part of FLINK-31783.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34695) Move Flink's CI docker container into a public repo

2024-03-15 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34695:
-

 Summary: Move Flink's CI docker container into a public repo
 Key: FLINK-34695
 URL: https://issues.apache.org/jira/browse/FLINK-34695
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


Currently, Flink's CI (GitHub Actions and Azure Pipelines) use a container to 
run the logic. The intention behind it is to have a way to mimick the CI setup 
locally as well.

The current Docker image is maintained from the 
[zentol/flink-ci-docker|https://github.com/zentol/flink-ci-docker] fork (owned 
by [~chesnay]) of 
[flink-ci/flink-ci-docker|https://github.com/flink-ci/flink-ci-docker] (owned 
by Ververica) which is not ideal. We should move this repo into a Apache-owned 
repository.

Additionally, the there's no workflow pushing the image automatically to a 
registry from where it can be used. Instead, the images were pushed to personal 
Docker Hub repos in the past (rmetzger, chesnay, mapohl). This is also not 
ideal. We should use a public repo using a GHA workflow to push the image to 
that repo.

Questions to answer here:
# Where shall the Docker image code be located?
# Which Docker registry should be used?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Removing documentation on Azure Pipelines for Flink forks

2024-03-14 Thread Matthias Pohl
Good pointing. I guess, marking it as deprecated and pointing to GitHub
Actions as the new workaround would be a better way than removing it
entirely for now.

On Thu, Mar 14, 2024 at 11:47 AM Sergey Nuyanzin 
wrote:

> Hi Matthias,
>
> thanks for driving  this
> agree GHA seems working ok
>
> however to be on the safe side what if we mark it for removal or deprecated
> first
> and then remove together with dropping support of 1.17 where GHA is not
> supported IIUC?
>
> On Thu, Mar 14, 2024 at 11:42 AM Matthias Pohl
>  wrote:
>
> > Hi everyone,
> > I'm wondering whether anyone has objections against removing the Azure
> > Pipelines Tutorial to "set up CI for a fork of the Flink repository" in
> the
> > Flink wiki. Flink's GitHub Actions workflow seems to work fine for forks
> > (at least for 1.18+ changes). No need to guide contributors to the
> > flink-mirror repository to create draft PRs. And it's not used that
> often,
> > anyway [2].
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
> > [2] https://github.com/flink-ci/flink-mirror/pulls?q=is%3Apr
> >
>
>
> --
> Best regards,
> Sergey
>


[DISCUSS] Removing documentation on Azure Pipelines for Flink forks

2024-03-14 Thread Matthias Pohl
Hi everyone,
I'm wondering whether anyone has objections against removing the Azure
Pipelines Tutorial to "set up CI for a fork of the Flink repository" in the
Flink wiki. Flink's GitHub Actions workflow seems to work fine for forks
(at least for 1.18+ changes). No need to guide contributors to the
flink-mirror repository to create draft PRs. And it's not used that often,
anyway [2].

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-Tutorial:SettingupAzurePipelinesforaforkoftheFlinkrepository
[2] https://github.com/flink-ci/flink-mirror/pulls?q=is%3Apr


Re: [VOTE] FLIP-402: Extend ZooKeeper Curator configurations

2024-03-14 Thread Matthias Pohl
Nothing to add from my side. Thanks, Alex.

+1 (binding)

On Thu, Mar 7, 2024 at 4:09 PM Alex Nitavsky  wrote:

> Hi everyone,
>
> I'd like to start a vote on FLIP-402 [1]. It introduces new configuration
> options for Apache Flink's ZooKeeper integration for high availability by
> reflecting existing Apache Curator configuration options. It has been
> discussed in this thread [2].
>
> I would like to start a vote.  The vote will be open for at least 72 hours
> (until March 10th 18:00 GMT) unless there is an objection or
> insufficient votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-402%3A+Extend+ZooKeeper+Curator+configurations
> [2] https://lists.apache.org/thread/gqgs2jlq6bmg211gqtgdn8q5hp5v9l1z
>
> Thanks
> Alex
>


Re: [VOTE] Release 1.19.0, release candidate #2

2024-03-14 Thread Matthias Pohl
Update on FLINK-34227 [1] which I mentioned above: Chesnay helped identify
a concurrency issue in the JobMaster shutdown logic which seems to be in
the code for quite some time. I created a PR fixing the issue hoping that
the test instability is resolved with it.

The concurrency issue doesn't really explain why it only started to appear
recently in a specific CI setup (GHA with AdaptiveScheduler). There is no
hint in the git history indicating that it's caused by some newly
introduced change. That is why I wouldn't make FLINK-34227 a reason to
cancel rc2. Instead, the fix can be provided in subsequent patch releases.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-34227

On Thu, Mar 14, 2024 at 8:49 AM Jane Chan  wrote:

> Hi Yun, Jing, Martijn and Lincoln,
>
> I'm seeking guidance on whether merging the bugfix[1][2] at this stage is
> appropriate. I want to ensure that the actions align with the current
> release process and do not disrupt the ongoing preparations.
>
> [1] https://issues.apache.org/jira/browse/FLINK-29114
> [2] https://github.com/apache/flink/pull/24492
>
> Best,
> Jane
>
> On Thu, Mar 14, 2024 at 1:33 PM Yun Tang  wrote:
>
> > +1 (non-binding)
> >
> >
> >   *
> > Verified the signature and checksum.
> >   *
> > Reviewed the release note PR
> >   *
> > Reviewed the web announcement PR
> >   *
> > Start a standalone cluster to submit the state machine example, which
> > works well.
> >   *
> > Checked the pre-built jars are generated via JDK8
> >   *
> > Verified the process profiler works well after setting
> > rest.profiling.enabled: true
> >
> > Best
> > Yun Tang
> >
> > 
> > From: Qingsheng Ren 
> > Sent: Wednesday, March 13, 2024 12:45
> > To: dev@flink.apache.org 
> > Subject: Re: [VOTE] Release 1.19.0, release candidate #2
> >
> > +1 (binding)
> >
> > - Verified signature and checksum
> > - Verified no binary in source
> > - Built from source
> > - Tested reading and writing Kafka with SQL client and Kafka connector
> > 3.1.0
> > - Verified source code tag
> > - Reviewed release note
> > - Reviewed web PR
> >
> > Thanks to all release managers and contributors for the awesome work!
> >
> > Best,
> > Qingsheng
> >
> > On Wed, Mar 13, 2024 at 1:23 AM Matthias Pohl
> >  wrote:
> >
> > > I want to share an update on FLINK-34227 [1]: It's still not clear
> what's
> > > causing the test instability. So far, we agreed in today's release sync
> > [2]
> > > that it's not considered a blocker because it is observed in 1.18
> nightly
> > > builds and it only appears in the GitHub Actions workflow. But I still
> > have
> > > a bit of a concern that this is something that was introduced in 1.19
> and
> > > backported to 1.18 after the 1.18.1 release (because the test
> instability
> > > started to appear more regularly in March; with one occurrence in
> > January).
> > > Additionally, I have no reason to believe, yet, that the instability is
> > > caused by some GHA-related infrastructure issue.
> > >
> > > So, if someone else has some capacity to help looking into it; that
> would
> > > be appreciated. I will continue my investigation tomorrow.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-34227
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/1.19+Release#id-1.19Release-03/12/2024
> > >
> > > On Tue, Mar 12, 2024 at 12:50 PM Benchao Li 
> > wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - checked signature and checksum: OK
> > > > - checkout copyright year in notice file: OK
> > > > - diffed source distribution with tag, make sure there is no
> > > > unexpected files: OK
> > > > - build from source : OK
> > > > - start a local cluster, played with jdbc connector: OK
> > > >
> > > > weijie guo  于2024年3月12日周二 16:55写道:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - Verified signature and checksum
> > > > > - Verified source distribution does not contains binaries
> > > > > - Build from source code and submit a word-count job successfully
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Weijie
>

Re: [VOTE] Release 1.19.0, release candidate #2

2024-03-12 Thread Matthias Pohl
I want to share an update on FLINK-34227 [1]: It's still not clear what's
causing the test instability. So far, we agreed in today's release sync [2]
that it's not considered a blocker because it is observed in 1.18 nightly
builds and it only appears in the GitHub Actions workflow. But I still have
a bit of a concern that this is something that was introduced in 1.19 and
backported to 1.18 after the 1.18.1 release (because the test instability
started to appear more regularly in March; with one occurrence in January).
Additionally, I have no reason to believe, yet, that the instability is
caused by some GHA-related infrastructure issue.

So, if someone else has some capacity to help looking into it; that would
be appreciated. I will continue my investigation tomorrow.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-34227
[2]
https://cwiki.apache.org/confluence/display/FLINK/1.19+Release#id-1.19Release-03/12/2024

On Tue, Mar 12, 2024 at 12:50 PM Benchao Li  wrote:

> +1 (non-binding)
>
> - checked signature and checksum: OK
> - checkout copyright year in notice file: OK
> - diffed source distribution with tag, make sure there is no
> unexpected files: OK
> - build from source : OK
> - start a local cluster, played with jdbc connector: OK
>
> weijie guo  于2024年3月12日周二 16:55写道:
> >
> > +1 (non-binding)
> >
> > - Verified signature and checksum
> > - Verified source distribution does not contains binaries
> > - Build from source code and submit a word-count job successfully
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Jane Chan  于2024年3月12日周二 16:38写道:
> >
> > > +1 (non-binding)
> > >
> > > - Verify that the source distributions do not contain any binaries;
> > > - Build the source distribution to ensure all source files have Apache
> > > headers;
> > > - Verify checksum and GPG signatures;
> > >
> > > Best,
> > > Jane
> > >
> > > On Tue, Mar 12, 2024 at 4:08 PM Xuannan Su 
> wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - Verified signature and checksum
> > > > - Verified that source distribution does not contain binaries
> > > > - Built from source code successfully
> > > > - Reviewed the release announcement PR
> > > >
> > > > Best regards,
> > > > Xuannan
> > > >
> > > > On Tue, Mar 12, 2024 at 2:18 PM Hang Ruan 
> > > wrote:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - Verified signatures and checksums
> > > > > - Verified that source does not contain binaries
> > > > > - Build source code successfully
> > > > > - Reviewed the release note and left a comment
> > > > >
> > > > > Best,
> > > > > Hang
> > > > >
> > > > > Feng Jin  于2024年3月12日周二 11:23写道:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - Verified signatures and checksums
> > > > > > - Verified that source does not contain binaries
> > > > > > - Build source code successfully
> > > > > > - Run a simple sql query successfully
> > > > > >
> > > > > > Best,
> > > > > > Feng Jin
> > > > > >
> > > > > >
> > > > > > On Tue, Mar 12, 2024 at 11:09 AM Ron liu 
> wrote:
> > > > > >
> > > > > > > +1 (non binding)
> > > > > > >
> > > > > > > quickly verified:
> > > > > > > - verified that source distribution does not contain binaries
> > > > > > > - verified checksums
> > > > > > > - built source code successfully
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Ron
> > > > > > >
> > > > > > > Jeyhun Karimov  于2024年3月12日周二 01:00写道:
> > > > > > >
> > > > > > > > +1 (non binding)
> > > > > > > >
> > > > > > > > - verified that source distribution does not contain binaries
> > > > > > > > - verified signatures and checksums
> > > > > > > > - built source code successfully
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Jeyhun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Mar 11, 2024 at 3:08 PM Samrat Deb <
> > > decordea...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (non binding)
> > > > > > > > >
> > > > > > > > > - verified signatures and checksums
> > > > > > > > > - ASF headers are present in all expected file
> > > > > > > > > - No unexpected binaries files found in the source
> > > > > > > > > - Build successful locally
> > > > > > > > > - tested basic word count example
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Bests,
> > > > > > > > > Samrat
> > > > > > > > >
> > > > > > > > > On Mon, 11 Mar 2024 at 7:33 PM, Ahmed Hamdy <
> > > > hamdy10...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Lincoln
> > > > > > > > > > +1 (non-binding) from me
> > > > > > > > > >
> > > > > > > > > > - Verified Checksums & Signatures
> > > > > > > > > > - Verified Source dists don't contain binaries
> > > > > > > > > > - Built source successfully
> > > > > > > > > > - reviewed web PR
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Ahmed Hamdy
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, 11 Mar 2024 at 15:18, 

[jira] [Created] (FLINK-34646) AggregateITCase.testDistinctWithRetract timed out

2024-03-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34646:
-

 Summary: AggregateITCase.testDistinctWithRetract timed out
 Key: FLINK-34646
 URL: https://issues.apache.org/jira/browse/FLINK-34646
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8211401561/job/22460442229#step:10:17161
{code}
"main" #1 prio=5 os_prio=0 tid=0x7f70abeb7000 nid=0x4cff3 waiting on 
condition [0x7f70ac3f6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xcd24c690> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
at 
org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
at 
org.apache.flink.table.planner.runtime.stream.sql.AggregateITCase.testDistinctWithRetract(AggregateITCase.scala:345)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34645) StreamArrowPythonGroupWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByCount fails

2024-03-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34645:
-

 Summary: 
StreamArrowPythonGroupWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByCount
 fails
 Key: FLINK-34645
 URL: https://issues.apache.org/jira/browse/FLINK-34645
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.18.1
Reporter: Matthias Pohl


{code}
Error: 02:27:17 02:27:17.025 [ERROR] Tests run: 3, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 0.658 s <<< FAILURE! - in 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.stream.StreamArrowPythonGroupWindowAggregateFunctionOperatorTest
Error: 02:27:17 02:27:17.025 [ERROR] 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.stream.StreamArrowPythonGroupWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByCount
  Time elapsed: 0.3 s  <<< FAILURE!
Mar 09 02:27:17 java.lang.AssertionError: 
Mar 09 02:27:17 
Mar 09 02:27:17 Expected size: 8 but was: 6 in:
Mar 09 02:27:17 [Record @ (undef) : 
+I(c1,0,1969-12-31T23:59:55,1970-01-01T00:00:05),
Mar 09 02:27:17 Record @ (undef) : 
+I(c2,3,1969-12-31T23:59:55,1970-01-01T00:00:05),
Mar 09 02:27:17 Record @ (undef) : 
+I(c2,3,1970-01-01T00:00,1970-01-01T00:00:10),
Mar 09 02:27:17 Record @ (undef) : 
+I(c1,0,1970-01-01T00:00,1970-01-01T00:00:10),
Mar 09 02:27:17 Watermark @ 1,
Mar 09 02:27:17 Watermark @ 2]
Mar 09 02:27:17 at 
org.apache.flink.table.runtime.util.RowDataHarnessAssertor.assertOutputEquals(RowDataHarnessAssertor.java:110)
Mar 09 02:27:17 at 
org.apache.flink.table.runtime.util.RowDataHarnessAssertor.assertOutputEquals(RowDataHarnessAssertor.java:70)
Mar 09 02:27:17 at 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.ArrowPythonAggregateFunctionOperatorTestBase.assertOutputEquals(ArrowPythonAggregateFunctionOperatorTestBase.java:62)
Mar 09 02:27:17 at 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.stream.StreamArrowPythonGroupWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByCount(StreamArrowPythonGroupWindowAggregateFunctionOperatorTest.java:326)
Mar 09 02:27:17 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34644) RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing failed with ConnectionClosedException

2024-03-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34644:
-

 Summary: 
RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing failed with 
ConnectionClosedException
 Key: FLINK-34644
 URL: https://issues.apache.org/jira/browse/FLINK-34644
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8189958608/job/22396362238#step:10:9215

{code}
Error: 15:13:33 15:13:33.779 [ERROR] Tests run: 68, Failures: 0, Errors: 1, 
Skipped: 4, Time elapsed: 17.81 s <<< FAILURE! -- in 
org.apache.flink.runtime.rest.RestServerEndpointITCase
Error: 15:13:33 15:13:33.779 [ERROR] 
org.apache.flink.runtime.rest.RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing
 -- Time elapsed: 0.329 s <<< ERROR!
Mar 07 15:13:33 java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.rest.ConnectionClosedException: Channel became 
inactive.
Mar 07 15:13:33 at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
Mar 07 15:13:33 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
Mar 07 15:13:33 at 
org.apache.flink.runtime.rest.RestServerEndpointITCase.testShouldWaitForHandlersWhenClosing(RestServerEndpointITCase.java:592)
Mar 07 15:13:33 at java.lang.reflect.Method.invoke(Method.java:498)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
Mar 07 15:13:33 at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
Mar 07 15:13:33 at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
Mar 07 15:13:33 at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
Mar 07 15:13:33 at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
Mar 07 15:13:33 at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
Mar 07 15:13:33 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Mar 07 15:13:33 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Mar 07 15:13:33 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Mar 07 15:13:33 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Mar 07 15:13:33 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Mar 07 15:13:33 Caused by: 
org.apache.flink.runtime.rest.ConnectionClosedException: Channel became 
inactive.
Mar 07 15:13:33 at 
org.apache.flink.runtime.rest.RestClient$ClientHandler.channelInactive(RestClient.java:749)
Mar 07 15:13:33 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
Mar 07 15:13:33 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
Mar 07 15:13:33 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:27

[jira] [Created] (FLINK-34643) JobIDLoggingITCase failed

2024-03-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34643:
-

 Summary: JobIDLoggingITCase failed
 Key: FLINK-34643
 URL: https://issues.apache.org/jira/browse/FLINK-34643
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58187=logs=8fd9202e-fd17-5b26-353c-ac1ff76c8f28=ea7cf968-e585-52cb-e0fc-f48de023a7ca=7897

{code}
Mar 09 01:24:23 01:24:23.498 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 4.209 s <<< FAILURE! -- in 
org.apache.flink.test.misc.JobIDLoggingITCase
Mar 09 01:24:23 01:24:23.498 [ERROR] 
org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(ClusterClient) 
-- Time elapsed: 1.459 s <<< ERROR!
Mar 09 01:24:23 java.lang.IllegalStateException: Too few log events recorded 
for org.apache.flink.runtime.jobmaster.JobMaster (12) - this must be a bug in 
the test code
Mar 09 01:24:23 at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:215)
Mar 09 01:24:23 at 
org.apache.flink.test.misc.JobIDLoggingITCase.assertJobIDPresent(JobIDLoggingITCase.java:148)
Mar 09 01:24:23 at 
org.apache.flink.test.misc.JobIDLoggingITCase.testJobIDLogging(JobIDLoggingITCase.java:132)
Mar 09 01:24:23 at java.lang.reflect.Method.invoke(Method.java:498)
Mar 09 01:24:23 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Mar 09 01:24:23 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Mar 09 01:24:23 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Mar 09 01:24:23 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Mar 09 01:24:23 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Mar 09 01:24:23 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34589) FineGrainedSlotManager doesn't handle errors in the resource reconcilliation step

2024-03-06 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34589:
-

 Summary: FineGrainedSlotManager doesn't handle errors in the 
resource reconcilliation step
 Key: FLINK-34589
 URL: https://issues.apache.org/jira/browse/FLINK-34589
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


I noticed during my work on FLINK-34427 that the reconcilliation is scheduled 
periodically when starting the {{SlotManager}}. But it doesn't handle errors in 
this step. I see two options here:
1. Fail fatally because such an error might indicate a major issue with the RM 
backend.
2. Log the failure and continue the scheduled task even in case of an error.

My understanding is that we're just not able to recreate TaskManagers which 
should be a transient issue and could be resolved in the backend (YARN, k8s). 
That's why I would lean towards option 2.

[~xtsong] WDYT?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34588) FineGrainedSlotManager checks whether resources need to reconcile but doesn't act on the result

2024-03-06 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34588:
-

 Summary: FineGrainedSlotManager checks whether resources need to 
reconcile but doesn't act on the result
 Key: FLINK-34588
 URL: https://issues.apache.org/jira/browse/FLINK-34588
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There are a few locations in {{FineGrainedSlotManager}} where we check whether 
resources can/need to be reconciled but don't care about the result and just 
trigger the resource update (e.g. in 
[FineGrainedSlotManager:620|https://github.com/apache/flink/blob/c0d3e495f4c2316a80f251de77b05b943b5be1f8/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/FineGrainedSlotManager.java#L620]
 and 
[FineGrainedSlotManager:676|https://github.com/apache/flink/blob/c0d3e495f4c2316a80f251de77b05b943b5be1f8/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/FineGrainedSlotManager.java#L676]).
 Looks like we could reduce the calls to the backend here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-06 Thread Matthias Pohl
Thanks for starting this discussion. I see the value of such a page if we
want to encourage companies to sponsor CI infrastructure in case we need
this infrastructure (as Yun Tang pointed out). The question is, though: Do
we need more VMs? The amount of commits to master is constantly decreasing
since its peak in 2019/2020 [1]. Did we observe shortage of CI runners in
the past years? What do we do if we have enough VMs? Do we still allow
companies to add more VMs to the pool even though it's not adding any
value? Then it becomes a marketing tool for companies. The community lacks
the openly accessible tools to monitor the VM usage independently as far as
I know (the Azure Pipelines project is owned by Ververica right now). My
concern is (which goes towards what Max is saying) that this can be a
source of friction in the community (even if it's not about individuals but
companies). I'm not sure whether the need for additional infrastructure
out-weights the risk for friction.

On another note: After monitoring the GitHub Action workflows (FLIP-396
[2]) for the past weeks, I figured that there could be a chance for us to
rely on Apache-provided infrastructure entirely with our current workload
when switching over from Azure Pipelines. But that might be a premature
judgement because the monitoring started after the feature freeze of Flink
1.19. We should wait with a final conclusion till the end of the 1.20
release cycle. Apache Infra increased the amount of VMs they are offering
since 2018 (when the Apache Flink community decided to go for Azure
Pipelines and custom VMs as far as I know). That's based on a conversation
I had with the Apache Infra folks at one of their roundtable meetings [3].
This and the fact that the amount of commits is decreasing in recent years
[1] (which correlates with the number of CI runs) could be indications that
additional VMs are not necessary (and with that, the need to have a Thank
You page as well).

But I acknowledge that Alibaba and Ververica would like to be recognized
for their financial contributions to the community in the past. Therefore,
I am fine with creating a Thank You page to acknowledge the financial
contributions from Alibaba and Ververica in the past (since Apache allows
historical donations) considering that the contributions of the two
companies go way back in time and are quite significant in my opinion. I
suggest focusing on the past for now because of the option to migrate to
Apache infrastructure midterm.

Best,
Matthias

[1] https://github.com/apache/flink/graphs/contributors
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+to+test+GitHub+Actions+as+an+alternative+for+Flink%27s+current+Azure+CI+infrastructure
[3]
https://cwiki.apache.org/confluence/display/INFRA/Infra+Roundtable+2023-12-06%2C+17%3A00+UTC

On Wed, Mar 6, 2024 at 7:06 AM tison  wrote:

> > a rare way different than
> > individuals (few individuals can donate such resources)
>
> Theoretically, if an individual donates so, we list list him/her as well.
>
> I've seen such donations in The Perl Foundation like [1]. But since a
> PMC doesn't have a fundraising office, we may not accept raw money
> anyway; it's already out of the thread :D
>
> Best,
> tison.
>
> [1] https://news.perlfoundation.org/post/announcement_of_the_ian_hague
>
> Yun Tang  于2024年3月6日周三 13:58写道:
> >
> > Thanks for Jark's proposal, and I'm +1 for adding such a page.
> >
> > The CI infrastructure helps the Apache Flink project to run well. I
> cannot imagine how insufficient CI machines would impact the development
> progress, especially when the feature freeze date is close. And I believe
> that most guys who contributed to the community would not know Alibaba and
> Ververica had ever donated several machines to make the community work
> smoothly for years.
> >
> >
> > Best
> > Yun Tang
> > 
> > From: Jark Wu 
> > Sent: Wednesday, March 6, 2024 11:35
> > To: dev@flink.apache.org 
> > Subject: Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website
> >
> > Hi Max,
> >
> > Thank you for your input.
> >
> > According to ASF policy[1], the Thank Page is intended to thank third
> > parties
> > that provide physical resources like machines, services, and software
> that
> > the committers
> >  or the project truly needs. I agree with Tison, such donation is
> countable
> > and that's why
> > I started this discussion to collect the full list. The thank Page is not
> > intended to thank working
> > hours or contributions from individual volunteers which I think
> > is recognized in other ways
> > (e.g., credit of committer and PMC member).
> >
> > Best,
> > Jark
> >
> > [1]: https://www.apache.org/foundation/marks/linking#projectthanks
> >
> > On Wed, 6 Mar 2024 at 01:14, tison  wrote:
> >
> > > Hi Max,
> > >
> > > Thanks for sharing your concerns :D
> > >
> > > I'd elaborate a bit on this topic with an example, that Apache Airflow
> > > has a small section for its 

[jira] [Created] (FLINK-34571) SortMergeResultPartitionReadSchedulerTest.testOnReadBufferRequestError failed due an assertion

2024-03-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34571:
-

 Summary: 
SortMergeResultPartitionReadSchedulerTest.testOnReadBufferRequestError failed 
due an assertion
 Key: FLINK-34571
 URL: https://issues.apache.org/jira/browse/FLINK-34571
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8134965216/job/8875618#step:10:8586
{code}
Error: 02:39:36 02:39:36.688 [ERROR] Tests run: 9, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 13.68 s <<< FAILURE! -- in 
org.apache.flink.runtime.io.network.partition.SortMergeResultPartitionReadSchedulerTest
Error: 02:39:36 02:39:36.689 [ERROR] 
org.apache.flink.runtime.io.network.partition.SortMergeResultPartitionReadSchedulerTest.testOnReadBufferRequestError
 -- Time elapsed: 0.174 s <<< FAILURE!
Mar 04 02:39:36 org.opentest4j.AssertionFailedError: 
Mar 04 02:39:36 
Mar 04 02:39:36 Expecting value to be true but was false
Mar 04 02:39:36 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Mar 04 02:39:36 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
Mar 04 02:39:36 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
Mar 04 02:39:36 at 
org.apache.flink.runtime.io.network.partition.SortMergeResultPartitionReadSchedulerTest.testOnReadBufferRequestError(SortMergeResultPartitionReadSchedulerTest.java:225)
Mar 04 02:39:36 at java.lang.reflect.Method.invoke(Method.java:498)
Mar 04 02:39:36 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Mar 04 02:39:36 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Mar 04 02:39:36 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Mar 04 02:39:36 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Mar 04 02:39:36 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34570) JoinITCase.testLeftJoinWithEqualPk times out

2024-03-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34570:
-

 Summary: JoinITCase.testLeftJoinWithEqualPk times out
 Key: FLINK-34570
 URL: https://issues.apache.org/jira/browse/FLINK-34570
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8127069912/job/22211928085#step:10:14479

{code}
"main" #1 prio=5 os_prio=0 tid=0x7ff4ae2b7000 nid=0x2168b waiting on 
condition [0x7ff4affdc000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xab096950> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
at 
org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
at 
org.apache.flink.table.planner.runtime.stream.sql.JoinITCase.testLeftJoinWithEqualPk(JoinITCase.scala:705)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34569) 'Streaming File Sink s3 end-to-end test' failed

2024-03-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34569:
-

 Summary: 'Streaming File Sink s3 end-to-end test' failed
 Key: FLINK-34569
 URL: https://issues.apache.org/jira/browse/FLINK-34569
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.19.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58026=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3957

{code}
Mar 02 04:12:57 Waiting until all values have been produced
Unable to find image 'stedolan/jq:latest' locally
Error: No such container: 
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": 
read tcp 10.1.0.97:42214->54.236.113.205:443: read: connection reset by peer.
See 'docker run --help'.
Mar 02 04:12:58 Number of produced values 0/6
Error: No such container: 
Unable to find image 'stedolan/jq:latest' locally
latest: Pulling from stedolan/jq
[DEPRECATION NOTICE] Docker Image Format v1, and Docker Image manifest version 
2, schema 1 support will be removed in an upcoming release. Suggest the author 
of docker.io/stedolan/jq:latest to upgrade the image to the OCI Format, or 
Docker Image manifest v2, schema 2. More information at 
https://docs.docker.com/go/deprecated-image-specs/
237d5fcd25cf: Pulling fs layer
[...]
4dae4fd48813: Pull complete
Digest: sha256:a61ed0bca213081b64be94c5e1b402ea58bc549f457c2682a86704dd55231e09
Status: Downloaded newer image for stedolan/jq:latest
parse error: Invalid numeric literal at line 1, column 6
Error: No such container: 
parse error: Invalid numeric literal at line 1, column 6
Error: No such container: 
parse error: Invalid numeric literal at line 1, column 6
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34568) YarnFileStageTest.destroyHDFS timed out

2024-03-03 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34568:
-

 Summary: YarnFileStageTest.destroyHDFS timed out
 Key: FLINK-34568
 URL: https://issues.apache.org/jira/browse/FLINK-34568
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hadoop Compatibility
Affects Versions: 1.17.2
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=58024=logs=5cae8624-c7eb-5c51-92d3-4d2dacedd221=5acec1b4-945b-59ca-34f8-168928ce5199=26698

{code}
Mar 02 07:28:56 "Listener at localhost/33933" #25 daemon prio=5 os_prio=0 
tid=0x7f08490be000 nid=0x12cae runnable [0x7f082ebfc000]
Mar 02 07:28:56java.lang.Thread.State: RUNNABLE
Mar 02 07:28:56 at 
org.mortbay.io.nio.SelectorManager$SelectSet.stop(SelectorManager.java:879)
Mar 02 07:28:56 - locked <0xd7ae0030> (a 
org.mortbay.io.nio.SelectorManager$SelectSet)
[...]
Mar 02 07:28:56 at 
org.apache.hadoop.hdfs.MiniDFSCluster.stopAndJoinNameNode(MiniDFSCluster.java:2123)
Mar 02 07:28:56 at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2060)
Mar 02 07:28:56 at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2031)
Mar 02 07:28:56 at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:2024)
Mar 02 07:28:56 at 
org.apache.flink.yarn.YarnFileStageTest.destroyHDFS(YarnFileStageTest.java:90)
[...]
{code}

Looks like a HDFS issue during shutdown? This will most likely also affect 
newer versions because there was not much done in the Yarn space since 1.17 
(hadoop was bumped in 1.17 itself; FLINK-29710).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34560) JoinITCase seems to fail on a broader scale (MiniCluster issue?)

2024-03-01 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34560:
-

 Summary: JoinITCase seems to fail on a broader scale (MiniCluster 
issue?)
 Key: FLINK-34560
 URL: https://issues.apache.org/jira/browse/FLINK-34560
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8105495458/job/22154140154#step:10:11906

It still needs to be investigated what's the actual cause here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34551) Align retry mechanisms of FutureUtils

2024-02-29 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34551:
-

 Summary: Align retry mechanisms of FutureUtils
 Key: FLINK-34551
 URL: https://issues.apache.org/jira/browse/FLINK-34551
 Project: Flink
  Issue Type: Technical Debt
  Components: API / Core
Affects Versions: 1.20.0
Reporter: Matthias Pohl


The retry mechanisms of FutureUtils include quite a bit of redundant code which 
makes it hard to understand and to extend. The logic should be aligned properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34527) Deprecate Time classes also in PyFlink

2024-02-27 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34527:
-

 Summary: Deprecate Time classes also in PyFlink
 Key: FLINK-34527
 URL: https://issues.apache.org/jira/browse/FLINK-34527
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.20.0
Reporter: Matthias Pohl


FLINK-32570 deprecated the Time classes. But we missed touched the 
PyFlink-related APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34514) e2e (1) times out because of an error that's most likely caused by a networking issue

2024-02-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34514:
-

 Summary: e2e (1) times out because of an error that's most likely 
caused by a networking issue
 Key: FLINK-34514
 URL: https://issues.apache.org/jira/browse/FLINK-34514
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/8027473891/job/21931649433

{code}
Sat, 24 Feb 2024 03:35:54 GMT
ERROR: failed to solve: process "/bin/sh -c set -ex;   wget -nv -O 
/usr/local/bin/gosu 
\"https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg 
--print-architecture)\";   wget -nv -O /usr/local/bin/gosu.asc 
\"https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg 
--print-architecture).asc\";   export GNUPGHOME=\"$(mktemp -d)\";   for server 
in ha.pool.sks-keyservers.net $(shuf -e   
hkp://p80.pool.sks-keyservers.net:80   
keyserver.ubuntu.com   hkp://keyserver.ubuntu.com:80
   pgp.mit.edu) ; do   gpg --batch --keyserver 
\"$server\" --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4 && break || : 
;   done &&   gpg --batch --verify /usr/local/bin/gosu.asc /usr/local/bin/gosu; 
  gpgconf --kill all;   rm -rf \"$GNUPGHOME\" /usr/local/bin/gosu.asc;   chmod 
+x /usr/local/bin/gosu;   gosu nobody true" did not complete successfully: exit 
code: 4
Sat, 24 Feb 2024 07:10:28 GMT
==
Sat, 24 Feb 2024 07:10:28 GMT
=== WARNING: This task took already 95% of the available time budget of 299 
minutes ===
Sat, 24 Feb 2024 07:10:28 GMT
==
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34513) GroupAggregateRestoreTest.testRestore fails

2024-02-25 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34513:
-

 Summary: GroupAggregateRestoreTest.testRestore fails
 Key: FLINK-34513
 URL: https://issues.apache.org/jira/browse/FLINK-34513
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57828=logs=26b84117-e436-5720-913e-3e280ce55cae=77cc7e77-39a0-5007-6d65-4137ac13a471=10881

{code}
Feb 24 01:12:01 01:12:01.384 [ERROR] Tests run: 10, Failures: 1, Errors: 0, 
Skipped: 1, Time elapsed: 2.957 s <<< FAILURE! -- in 
org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest
Feb 24 01:12:01 01:12:01.384 [ERROR] 
org.apache.flink.table.planner.plan.nodes.exec.stream.GroupAggregateRestoreTest.testRestore(TableTestProgram,
 ExecNodeMetadata)[4] -- Time elapsed: 0.653 s <<< FAILURE!
Feb 24 01:12:01 java.lang.AssertionError: 
Feb 24 01:12:01 
Feb 24 01:12:01 Expecting actual:
Feb 24 01:12:01   ["+I[3, 1, 2, 8, 31, 10.0, 3]",
Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]",
Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]",
Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]",
Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]",
Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 1]",
Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]",
Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]",
Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]",
Feb 24 01:12:01 "+U[7, 0, 1, 7, 7, 7.0, 2]"]
Feb 24 01:12:01 to contain exactly in any order:
Feb 24 01:12:01   ["+I[3, 1, 2, 8, 31, 10.0, 3]",
Feb 24 01:12:01 "+I[2, 1, 4, 14, 42, 7.0, 6]",
Feb 24 01:12:01 "+I[1, 1, 4, 12, 24, 6.0, 4]",
Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 8.0, 7]",
Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 6.0, 5]",
Feb 24 01:12:01 "+U[3, 1, 2, 8, 31, 9.0, 3]",
Feb 24 01:12:01 "+U[2, 1, 4, 14, 57, 7.0, 7]",
Feb 24 01:12:01 "+I[7, 0, 1, 7, 7, 7.0, 2]",
Feb 24 01:12:01 "+U[1, 1, 4, 12, 32, 5.0, 5]"]
Feb 24 01:12:01 elements not found:
Feb 24 01:12:01   ["+I[7, 0, 1, 7, 7, 7.0, 2]"]
Feb 24 01:12:01 and elements not expected:
Feb 24 01:12:01   ["+I[7, 0, 1, 7, 7, 7.0, 1]", "+U[7, 0, 1, 7, 7, 7.0, 2]"]
Feb 24 01:12:01 
Feb 24 01:12:01 at 
org.apache.flink.table.planner.plan.nodes.exec.testutils.RestoreTestBase.testRestore(RestoreTestBase.java:313)
Feb 24 01:12:01 at 
java.base/java.lang.reflect.Method.invoke(Method.java:580)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34508) Migrate S3-related ITCases and e2e tests to Minio

2024-02-23 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34508:
-

 Summary: Migrate S3-related ITCases and e2e tests to Minio 
 Key: FLINK-34508
 URL: https://issues.apache.org/jira/browse/FLINK-34508
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


Anything that uses {{org.apache.flink.testutils.s3.S3TestCredentials}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34495) Resuming Savepoint (rocks, scale up, heap timers) end-to-end test failure

2024-02-21 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34495:
-

 Summary: Resuming Savepoint (rocks, scale up, heap timers) 
end-to-end test failure
 Key: FLINK-34495
 URL: https://issues.apache.org/jira/browse/FLINK-34495
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57760=logs=e9d3d34f-3d15-59f4-0e3e-35067d100dfe=5d91035e-8022-55f2-2d4f-ab121508bf7e=2010

I guess the failure occurred due to the existence of a checkpoint failure:
{code}
Feb 22 00:49:16 2024-02-22 00:49:04,305 WARN  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger or complete checkpoint 12 for job 3c9ffc670ead2cb3c4118410cbef3b72. (0 
consecutive failed attempts so far)
Feb 22 00:49:16 org.apache.flink.runtime.checkpoint.CheckpointException: 
Checkpoint Coordinator is suspending.
Feb 22 00:49:16 at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:2056)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.scheduler.SchedulerBase.stopCheckpointScheduler(SchedulerBase.java:960)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.scheduler.SchedulerBase.stopWithSavepoint(SchedulerBase.java:1030)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.jobmaster.JobMaster.stopWithSavepoint(JobMaster.java:901)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
Feb 22 00:49:16 at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
Feb 22 00:49:16 at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
Feb 22 00:49:16 at java.lang.reflect.Method.invoke(Method.java:566) 
~[?:?]
Feb 22 00:49:16 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:309)
 ~[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:307)
 ~[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:222)
 ~[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)
 ~[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
 ~[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20-SNAPSHOT]
Feb 22 00:49:16 at 
org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) 
[flink-rpc-akkad6c8f388-439d-487d-ab4d-9a34a56cbc0d.jar:1.20

[jira] [Created] (FLINK-34489) New File Sink end-to-end test timed out

2024-02-21 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34489:
-

 Summary: New File Sink end-to-end test timed out
 Key: FLINK-34489
 URL: https://issues.apache.org/jira/browse/FLINK-34489
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57707=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=0f3adb59-eefa-51c6-2858-3654d9e0749d=3726

{code}
eb 21 07:26:03 Number of produced values 10770/6
Feb 21 07:39:50 Test (pid: 151375) did not finish after 900 seconds.
Feb 21 07:39:50 Printing Flink logs and killing it:
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34488) Integrate snapshot deployment into GHA nightly workflow

2024-02-21 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34488:
-

 Summary: Integrate snapshot deployment into GHA nightly workflow
 Key: FLINK-34488
 URL: https://issues.apache.org/jira/browse/FLINK-34488
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


Analogously to the [Azure Pipelines nightly 
config|https://github.com/apache/flink/blob/e923d4060b6dabe650a8950774d176d3e92437c2/tools/azure-pipelines/build-apache-repo.yml#L103]
 we want to deploy the snapshot artifacts in the GHA nightly workflow as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34487) Integrate tools/azure-pipelines/build-python-wheels.yml into GHA nightly workflow

2024-02-21 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34487:
-

 Summary: Integrate tools/azure-pipelines/build-python-wheels.yml 
into GHA nightly workflow
 Key: FLINK-34487
 URL: https://issues.apache.org/jira/browse/FLINK-34487
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


Analogously to the [Azure Pipelines nightly 
config|https://github.com/apache/flink/blob/e923d4060b6dabe650a8950774d176d3e92437c2/tools/azure-pipelines/build-apache-repo.yml#L183]
 we want to generate the wheels artifacts in the GHA nightly workflow as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34486) Add documentation on how to add the shared utils as a submodule to the connector repo

2024-02-21 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34486:
-

 Summary: Add documentation on how to add the shared utils as a 
submodule to the connector repo
 Key: FLINK-34486
 URL: https://issues.apache.org/jira/browse/FLINK-34486
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Common
Affects Versions: connector-parent-1.1.0
Reporter: Matthias Pohl


[apache/flink-connector-shared-utils:README.md|https://github.com/apache/flink-connector-shared-utils/blob/release_utils/README.md]
 doesn't state how a the shared utils shall be added as a submodule to a 
connector repository. But this is expected from within [connector release 
documentation|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+flink-connector+release#Creatingaflinkconnectorrelease-Buildareleasecandidate]:
{quote}
The following sections assume that the release_utils branch from 
flink-connector-shared-utils is mounted as a git submodule under 
tools/releasing/shared, you can update the submodule by running  git submodule 
update --remote (or git submodule update --init --recursive if the submodule 
wasn't initialized, yet) to use latest release utils, you need to mount the  
flink-connector-shared-utils  as a submodule under the tools/releasing/shared 
if it hasn't been mounted in the connector repository. See the README for 
details.
{quote}

Let's update the README accordingly and add a link to {{README}} in the 
connector release documentation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34475) ZooKeeperLeaderElectionDriverTest failed with exit code 2

2024-02-20 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34475:
-

 Summary: ZooKeeperLeaderElectionDriverTest failed with exit code 2
 Key: FLINK-34475
 URL: https://issues.apache.org/jira/browse/FLINK-34475
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1
Reporter: Matthias Pohl


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57649=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702=8746]
{code:java}
Feb 20 01:20:02 01:20:02.369 [ERROR] Process Exit Code: 2
Feb 20 01:20:02 01:20:02.369 [ERROR] Crashed tests:
Feb 20 01:20:02 01:20:02.369 [ERROR] 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriverTest
Feb 20 01:20:02 01:20:02.369 [ERROR]at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:748)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release flink-connector-jdbc, release candidate #3

2024-02-20 Thread Matthias Pohl
+1 (binding)

* Downloaded artifacts
* Extracted sources and compiled them
* Checked diff of git tag checkout with downloaded sources
* Verified SHA512 & GPG checksums
* Checked that all POMs have the expected version
* Generated diffs to compare pom file changes with NOTICE files

On Tue, Feb 20, 2024 at 11:23 AM Leonard Xu  wrote:

> Thanks Sergey for driving this release.
>
> +1 (binding)
>
> - verified signatures
> - verified hashsums
> - built from source code with Maven 3.8.1 and Scala 2.12 succeeded
> - checked Github release tag
> - checked release notes
> - reviewed all jira tickets has been resolved
> - reviewed the web PR and left one minor comment about backporting bugfix
> to main branch
> **Note** The release date in jira[1] need to be updated
>
> Best,
> Leonard
> [1] https://issues.apache.org/jira/projects/FLINK/versions/12354088
>
>
> > 2024年2月20日 下午5:15,Sergey Nuyanzin  写道:
> >
> > +1 (non-binding)
> >
> > - Validated checksum hash
> > - Verified signature from another machine
> > - Checked that tag is present in Github
> > - Built the source
> >
> > On Tue, Feb 20, 2024 at 10:13 AM Sergey Nuyanzin 
> > wrote:
> >
> >> Hi David
> >> thanks for checking and sorry for the late reply
> >>
> >> yep, that's ok this just means that you haven't signed my key which is
> ok
> >> (usually it could happen during virtual key signing parties)
> >>
> >> For release checking it is ok to check that the key which was used to
> sign
> >> the artifacts is included into Flink release KEYS file [1]
> >>
> >> [1] https://dist.apache.org/repos/dist/release/flink/KEYS
> >>
> >> On Thu, Feb 8, 2024 at 3:50 PM David Radley 
> >> wrote:
> >>
> >>> Thanks Sergey,
> >>>
> >>> It looks better now.
> >>>
> >>> gpg --verify flink-connector-jdbc-3.1.2-1.18.jar.asc
> >>>
> >>> gpg: assuming signed data in 'flink-connector-jdbc-3.1.2-1.18.jar'
> >>>
> >>> gpg: Signature made Thu  1 Feb 10:54:45 2024 GMT
> >>>
> >>> gpg:using RSA key
> F7529FAE24811A5C0DF3CA741596BBF0726835D8
> >>>
> >>> gpg: Good signature from "Sergey Nuyanzin (CODE SIGNING KEY)
> >>> snuyan...@apache.org" [unknown]
> >>>
> >>> gpg: aka "Sergey Nuyanzin (CODE SIGNING KEY)
> >>> snuyan...@gmail.com" [unknown]
> >>>
> >>> gpg: aka "Sergey Nuyanzin snuyan...@gmail.com >>> snuyan...@gmail.com>" [unknown]
> >>>
> >>> gpg: WARNING: This key is not certified with a trusted signature!
> >>>
> >>> gpg:  There is no indication that the signature belongs to the
> >>> owner.
> >>>
> >>> I assume the warning is ok,
> >>>  Kind regards, David.
> >>>
> >>> From: Sergey Nuyanzin 
> >>> Date: Thursday, 8 February 2024 at 14:39
> >>> To: dev@flink.apache.org 
> >>> Subject: [EXTERNAL] Re: FW: RE: [VOTE] Release flink-connector-jdbc,
> >>> release candidate #3
> >>> Hi David
> >>>
> >>> it looks like in your case you don't specify the jar itself and
> probably
> >>> it
> >>> is not in current dir
> >>> so it should be something like that (assuming that both asc and jar
> file
> >>> are downloaded and are in current folder)
> >>> gpg --verify flink-connector-jdbc-3.1.2-1.16.jar.asc
> >>> flink-connector-jdbc-3.1.2-1.16.jar
> >>>
> >>> Here it is a more complete guide how to do it for Apache projects [1]
> >>>
> >>> [1] https://www.apache.org/info/verification.html#CheckingSignatures
> >>>
> >>> On Thu, Feb 8, 2024 at 12:38 PM David Radley 
> >>> wrote:
> >>>
>  Hi,
>  I was looking more at the asc files. I imported the keys and tried.
> 
> 
>  gpg --verify flink-connector-jdbc-3.1.2-1.16.jar.asc
> 
>  gpg: no signed data
> 
>  gpg: can't hash datafile: No data
> 
>  This seems to be the same for all the asc file. It does not look
> right;
> >>> am
>  I doing doing incorrect?
>    Kind regards, David.
> 
> 
>  From: David Radley 
>  Date: Thursday, 8 February 2024 at 10:46
>  To: dev@flink.apache.org 
>  Subject: [EXTERNAL] RE: [VOTE] Release flink-connector-jdbc, release
>  candidate #3
>  +1 (non-binding)
> 
>  I assume that thttps://github.com/apache/flink-web/pull/707 and be
>  completed after the release is out.
> 
>  From: Martijn Visser 
>  Date: Friday, 2 February 2024 at 08:38
>  To: dev@flink.apache.org 
>  Subject: [EXTERNAL] Re: [VOTE] Release flink-connector-jdbc, release
>  candidate #3
>  +1 (binding)
> 
>  - Validated hashes
>  - Verified signature
>  - Verified that no binaries exist in the source archive
>  - Build the source with Maven
>  - Verified licenses
>  - Verified web PRs
> 
>  On Fri, Feb 2, 2024 at 9:31 AM Yanquan Lv 
> wrote:
> 
> > +1 (non-binding)
> >
> > - Validated checksum hash
> > - Verified signature
> > - Build the source with Maven and jdk8/11/17
> > - Check that the jar is built by jdk8
> > - Verified that no binaries exist in the 

Re: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations

2024-02-20 Thread Matthias Pohl
Thanks for your reply Zhu Zhu. I guess, if we don't see any value in
aligning the parameter names (I don't have a strong argument either aside
from "it looks nicer"), there wouldn't be a need to add it to the
guidelines as well.

Sorry for not responding right away. I did a bit of research on the
AuthInfo configuration parameter (server side authorization [1], SO thread
on utilizing curator's authorization API [2]). It looks like using
String#getBytes() is the valid approach to configure this. So, in this way,
I don't have anything to add to this FLIP proposal.

+1 LGTM

[1]
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Client-Server+mutual+authentication
[2] https://stackoverflow.com/questions/40427700/using-acl-with-curator

On Wed, Jan 24, 2024 at 12:35 PM Zhu Zhu  wrote:

> @Matthias
> Thanks for raising the question.
>
> AFAIK, there is no guide of this naming convention yet. But +1 to add this
> naming
> convention in Flink code style guide. So that new configuration names can
> follow
> the guide.
>
> However, I tend to not force the configuration name alignment in Flink 2.0.
> It does not bring obvious benefits to users but will increase the
> migration cost.
> And the feature freeze of 1.19 is coming soon. I think we can add aligned
> key
> names for those exceptional config options in 1.20, but remove the old
> keys in
> later major versions.
>
> Thanks,
> Zhu
>
> Matthias Pohl  于2024年1月23日周二 19:45写道:
>
>> - Regarding names: sure it totally makes sense to follow the kebab case
>>> and Flip has reflected the change.
>>> Regarding the convention, Flink has this widely used configuration
>>> storageDir, which doesn't follow the kebab rule and creates some confusion.
>>> IMHO it would be valuable to add a clear guide.
>>
>>
>> Ah true, I should have checked the HA-related parameters as well.
>> Initially, I just briefly skimmed over a few ConfigOptions names.
>>
>> @Zhu Zhu Is the alignment of the configuration parameter names also part
>> of the 2.0 efforts that touch the Flink configuration? Is there a guideline
>> we can follow here which is future-proof in terms of parameter naming?
>>
>> - I am considering calling the next method from the Curator framework:
>>> authorization(List) [2]. I have added necessary details regarding
>>> Map -> List(AuthInfo) conversion, taking into account that
>>> AuthInfo has a constructor with String, byte[] parameters.
>>>
>>
>> The update in the FLIP looks good to me.
>>
>> - Good point. Please let me know if I am missing something, but it seems
>>> that we already can influence ACLProvider for Curator in Flink with
>>> high-availability.zookeeper.client.acl [2] . The way it is done currently
>>> is translation of the predefined constant to some predefined ACL Provider
>>> [3]. I do not see if we can add something to the current FLIP. I suppose
>>> that eventual extension of the supported ACLProvider would be
>>> straightforward and could be done outside of the current Flip as soon
>>> as concrete use-case requirements arise.
>>
>>
>> Thanks for the pointer. My concern is just that, we might have to
>> consider certain formats for the AuthInfo to be aligned with ACLProviders.
>>
>> @Marton: I know that ZooKeeper is probably a bit unrelated to FLIP-211
>> [1] but since you worked on the Kerberos delegation token provider: Is
>> there something to consider for the ZK Kerberos integration? Maybe, you can
>> help us out.
>>
>> Matthias
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
>>
>> On Mon, Jan 22, 2024 at 6:55 PM Alex Nitavsky 
>> wrote:
>>
>>> Hello Matthias,
>>>
>>> Thanks a lot for the feedback.
>>>
>>> - Regarding names: sure it totally makes sense to follow the kebab case
>>> and Flip has reflected the change.
>>> Regarding the convention, Flink has this widely used configuration
>>> storageDir, which doesn't follow the kebab rule and creates some confusion.
>>> IMHO it would be valuable to add a clear guide.
>>>
>>> - I am considering calling the next method from the Curator framework:
>>> authorization(List) [2]. I have added necessary details regarding
>>> Map -> List(AuthInfo) conversion, taking into account that
>>> AuthInfo has a constructor with String, byte[] parameters.
>>>
>>> - Good point. Please let me know if I am missing something, but it seems
>>> that we already can influence

Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun

2024-02-19 Thread Matthias Pohl
Congratulations, Jiabao!

On Mon, Feb 19, 2024 at 12:21 PM He Wang  wrote:

> Congrats, Jiabao!
>
> On Mon, Feb 19, 2024 at 7:19 PM Benchao Li  wrote:
>
> > Congrats, Jiabao!
> >
> > Zhanghao Chen  于2024年2月19日周一 18:42写道:
> > >
> > > Congrats, Jiaba!
> > >
> > > Best,
> > > Zhanghao Chen
> > > 
> > > From: Qingsheng Ren 
> > > Sent: Monday, February 19, 2024 17:53
> > > To: dev ; jiabao...@apache.org <
> > jiabao...@apache.org>
> > > Subject: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun
> > >
> > > Hi everyone,
> > >
> > > On behalf of the PMC, I'm happy to announce Jiabao Sun as a new Flink
> > > Committer.
> > >
> > > Jiabao began contributing in August 2022 and has contributed 60+
> commits
> > > for Flink main repo and various connectors. His most notable
> contribution
> > > is being the core author and maintainer of MongoDB connector, which is
> > > fully functional in DataStream and Table/SQL APIs. Jiabao is also the
> > > author of FLIP-377 and the main contributor of JUnit 5 migration in
> > runtime
> > > and table planner modules.
> > >
> > > Beyond his technical contributions, Jiabao is an active member of our
> > > community, participating in the mailing list and consistently
> > volunteering
> > > for release verifications and code reviews with enthusiasm.
> > >
> > > Please join me in congratulating Jiabao for becoming an Apache Flink
> > > committer!
> > >
> > > Best,
> > > Qingsheng (on behalf of the Flink PMC)
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>


[jira] [Created] (FLINK-34464) actions/cache@v4 times out

2024-02-19 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34464:
-

 Summary: actions/cache@v4 times out
 Key: FLINK-34464
 URL: https://issues.apache.org/jira/browse/FLINK-34464
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Test Infrastructure
Reporter: Matthias Pohl


[https://github.com/apache/flink/actions/runs/7953599167/job/21710058433#step:4:125]

Pulling the docker image stalled. This should be a temporary issue:
{code:java}
/usr/bin/docker exec  
601a5a6e68acf3ba38940ec7a07e08d7c57e763ca0364070124f71bc2f708bc3 sh -c "cat 
/etc/*release | grep ^ID"
120Received 260046848 of 1429155280 (18.2%), 248.0 MBs/sec
121Received 545259520 of 1429155280 (38.2%), 260.0 MBs/sec
[...]
Received 914358272 of 1429155280 (64.0%), 0.0 MBs/sec
21645Received 914358272 of 1429155280 (64.0%), 0.0 MBs/sec
21646Error: The operation was canceled. {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34450) TwoInputStreamTaskTest.testWatermarkAndWatermarkStatusForwarding failed

2024-02-16 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34450:
-

 Summary: 
TwoInputStreamTaskTest.testWatermarkAndWatermarkStatusForwarding failed
 Key: FLINK-34450
 URL: https://issues.apache.org/jira/browse/FLINK-34450
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://github.com/XComp/flink/actions/runs/7927275243/job/21643615491#step:10:9880

{code}
Error: 07:48:06 07:48:06.643 [ERROR] Tests run: 11, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 0.309 s <<< FAILURE! -- in 
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest
Error: 07:48:06 07:48:06.646 [ERROR] 
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest.testWatermarkAndWatermarkStatusForwarding
 -- Time elapsed: 0.036 s <<< FAILURE!
Feb 16 07:48:06 Output was not correct.: array lengths differed, 
expected.length=8 actual.length=7; arrays first differed at element [6]; 
expected: but was:
Feb 16 07:48:06 at 
org.junit.internal.ComparisonCriteria.arrayEquals(ComparisonCriteria.java:78)
Feb 16 07:48:06 at 
org.junit.internal.ComparisonCriteria.arrayEquals(ComparisonCriteria.java:28)
Feb 16 07:48:06 at org.junit.Assert.internalArrayEquals(Assert.java:534)
Feb 16 07:48:06 at org.junit.Assert.assertArrayEquals(Assert.java:285)
Feb 16 07:48:06 at 
org.apache.flink.streaming.util.TestHarnessUtil.assertOutputEquals(TestHarnessUtil.java:59)
Feb 16 07:48:06 at 
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTaskTest.testWatermarkAndWatermarkStatusForwarding(TwoInputStreamTaskTest.java:248)
Feb 16 07:48:06 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 16 07:48:06 Caused by: java.lang.AssertionError: expected: 
but was:
Feb 16 07:48:06 at org.junit.Assert.fail(Assert.java:89)
Feb 16 07:48:06 at org.junit.Assert.failNotEquals(Assert.java:835)
Feb 16 07:48:06 at org.junit.Assert.assertEquals(Assert.java:120)
Feb 16 07:48:06 at org.junit.Assert.assertEquals(Assert.java:146)
Feb 16 07:48:06 at 
org.junit.internal.ExactComparisonCriteria.assertElementsEqual(ExactComparisonCriteria.java:8)
Feb 16 07:48:06 at 
org.junit.internal.ComparisonCriteria.arrayEquals(ComparisonCriteria.java:76)
Feb 16 07:48:06 ... 6 more
{code}

I couldn't reproduce it locally with 2 runs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34449) Flink build took too long

2024-02-16 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34449:
-

 Summary: Flink build took too long
 Key: FLINK-34449
 URL: https://issues.apache.org/jira/browse/FLINK-34449
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Test Infrastructure
Reporter: Matthias Pohl


We saw a timeout when building Flink in e2e1 stage. No logs are available to 
investigate the issue:
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57551=logs=bbb1e2a2-a43c-55c8-fb48-5cfe7a8a0ca6

{code}
Nothing to show. Final logs are missing. This can happen when the job is 
cancelled or times out.
{code}

I'd consider this an infrastructure issue but created the Jira issue for 
documentation purposes. Let's see whether that pops up again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34448) ChangelogLocalRecoveryITCase failed fatally with 127 exit code

2024-02-15 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34448:
-

 Summary: ChangelogLocalRecoveryITCase failed fatally with 127 exit 
code
 Key: FLINK-34448
 URL: https://issues.apache.org/jira/browse/FLINK-34448
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57550=logs=2c3cbe13-dee0-5837-cf47-3053da9a8a78=b78d9d30-509a-5cea-1fef-db7abaa325ae=8897
\
{code}
Feb 16 02:43:47 02:43:47.142 [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.2.2:test (integration-tests) 
on project flink-tests: 
Feb 16 02:43:47 02:43:47.142 [ERROR] 
Feb 16 02:43:47 02:43:47.142 [ERROR] Please refer to 
/__w/1/s/flink-tests/target/surefire-reports for the individual test results.
Feb 16 02:43:47 02:43:47.142 [ERROR] Please refer to dump files (if any exist) 
[date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
Feb 16 02:43:47 02:43:47.142 [ERROR] ExecutionException The forked VM 
terminated without properly saying goodbye. VM crash or System.exit called?
Feb 16 02:43:47 02:43:47.142 [ERROR] Command was /bin/sh -c cd 
'/__w/1/s/flink-tests' && '/usr/lib/jvm/jdk-11.0.19+7/bin/java' '-XX:+UseG1GC' 
'-Xms256m' '-XX:+IgnoreUnrecognizedVMOptions' 
'--add-opens=java.base/java.util=ALL-UNNAMED' 
'--add-opens=java.base/java.io=ALL-UNNAMED' '-Xmx1536m' '-jar' 
'/__w/1/s/flink-tests/target/surefire/surefirebooter-20240216015747138_560.jar' 
'/__w/1/s/flink-tests/target/surefire' '2024-02-16T01-57-43_286-jvmRun4' 
'surefire-20240216015747138_558tmp' 'surefire_185-20240216015747138_559tmp'
Feb 16 02:43:47 02:43:47.142 [ERROR] Error occurred in starting fork, check 
output in log
Feb 16 02:43:47 02:43:47.142 [ERROR] Process Exit Code: 127
Feb 16 02:43:47 02:43:47.142 [ERROR] Crashed tests:
Feb 16 02:43:47 02:43:47.142 [ERROR] 
org.apache.flink.test.checkpointing.ChangelogLocalRecoveryITCase
Feb 16 02:43:47 02:43:47.142 [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Feb 16 02:43:47 02:43:47.142 [ERROR] Command was /bin/sh -c cd 
'/__w/1/s/flink-tests' && '/usr/lib/jvm/jdk-11.0.19+7/bin/java' '-XX:+UseG1GC' 
'-Xms256m' '-XX:+IgnoreUnrecognizedVMOptions' 
'--add-opens=java.base/java.util=ALL-UNNAMED' 
'--add-opens=java.base/java.io=ALL-UNNAMED' '-Xmx1536m' '-jar' 
'/__w/1/s/flink-tests/target/surefire/surefirebooter-20240216015747138_560.jar' 
'/__w/1/s/flink-tests/target/surefire' '2024-02-16T01-57-43_286-jvmRun4' 
'surefire-20240216015747138_558tmp' 'surefire_185-20240216015747138_559tmp'
Feb 16 02:43:47 02:43:47.142 [ERROR] Error occurred in starting fork, check 
output in log
Feb 16 02:43:47 02:43:47.142 [ERROR] Process Exit Code: 127
Feb 16 02:43:47 02:43:47.142 [ERROR] Crashed tests:
Feb 16 02:43:47 02:43:47.142 [ERROR] 
org.apache.flink.test.checkpointing.ChangelogLocalRecoveryITCase
Feb 16 02:43:47 02:43:47.142 [ERROR]at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34447) ActiveResourceManagerTest#testWorkerRegistrationTimeoutNotCountingAllocationTime still fails on slow machines

2024-02-15 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34447:
-

 Summary: 
ActiveResourceManagerTest#testWorkerRegistrationTimeoutNotCountingAllocationTime
 still fails on slow machines
 Key: FLINK-34447
 URL: https://issues.apache.org/jira/browse/FLINK-34447
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


This appeared in this [PR CI 
run|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57529=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=7997]
 of FLINK-34427.
{code}
Feb 14 18:50:01 18:50:01.283 [ERROR] Tests run: 18, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 0.665 s <<< FAILURE! -- in 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest
Feb 14 18:50:01 18:50:01.283 [ERROR] 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest.testWorkerRegistrationTimeoutNotCountingAllocationTime
 -- Time elapsed: 0.197 s <<< FAILURE!
Feb 14 18:50:01 java.lang.AssertionError: 
Feb 14 18:50:01 
Feb 14 18:50:01 Expecting
Feb 14 18:50:01   
Feb 14 18:50:01 not to be done.
Feb 14 18:50:01 Be aware that the state of the future in this message might not 
reflect the one at the time when the assertion was performed as it is evaluated 
later on
Feb 14 18:50:01 at 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest$15.lambda$new$3(ActiveResourceManagerTest.java:982)
Feb 14 18:50:01 at 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest$Context.runTest(ActiveResourceManagerTest.java:1133)
Feb 14 18:50:01 at 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest$15.(ActiveResourceManagerTest.java:963)
Feb 14 18:50:01 at 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManagerTest.testWorkerRegistrationTimeoutNotCountingAllocationTime(ActiveResourceManagerTest.java:946)
Feb 14 18:50:01 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 14 18:50:01 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Feb 14 18:50:01 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Feb 14 18:50:01 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Feb 14 18:50:01 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Feb 14 18:50:01 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
{code}

But I was able to reproduce it locally as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34443) YARNFileReplicationITCase.testPerJobModeWithCustomizedFileReplication failed when deploying job cluster

2024-02-14 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34443:
-

 Summary: 
YARNFileReplicationITCase.testPerJobModeWithCustomizedFileReplication failed 
when deploying job cluster
 Key: FLINK-34443
 URL: https://issues.apache.org/jira/browse/FLINK-34443
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI, Runtime / Coordination, Test 
Infrastructure
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7895502206/job/21548246199#step:10:28804

{code}
Error: 03:04:05 03:04:05.066 [ERROR] Tests run: 2, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 68.10 s <<< FAILURE! -- in 
org.apache.flink.yarn.YARNFileReplicationITCase
Error: 03:04:05 03:04:05.067 [ERROR] 
org.apache.flink.yarn.YARNFileReplicationITCase.testPerJobModeWithCustomizedFileReplication
 -- Time elapsed: 1.982 s <<< ERROR!
Feb 14 03:04:05 org.apache.flink.client.deployment.ClusterDeploymentException: 
Could not deploy Yarn job cluster.
Feb 14 03:04:05 at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:566)
Feb 14 03:04:05 at 
org.apache.flink.yarn.YARNFileReplicationITCase.deployPerJob(YARNFileReplicationITCase.java:109)
Feb 14 03:04:05 at 
org.apache.flink.yarn.YARNFileReplicationITCase.lambda$testPerJobModeWithCustomizedFileReplication$0(YARNFileReplicationITCase.java:73)
Feb 14 03:04:05 at 
org.apache.flink.yarn.YarnTestBase.runTest(YarnTestBase.java:303)
Feb 14 03:04:05 at 
org.apache.flink.yarn.YARNFileReplicationITCase.testPerJobModeWithCustomizedFileReplication(YARNFileReplicationITCase.java:73)
Feb 14 03:04:05 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 14 03:04:05 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Feb 14 03:04:05 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Feb 14 03:04:05 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Feb 14 03:04:05 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Feb 14 03:04:05 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Feb 14 03:04:05 Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/user/root/.flink/application_1707879779446_0002/log4j-api-2.17.1.jar could 
only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) 
running and 2 node(s) are excluded in this operation.
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2260)
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2813)
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:908)
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:577)
Feb 14 03:04:05 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)
Feb 14 03:04:05 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
Feb 14 03:04:05 at java.security.AccessController.doPrivileged(Native 
Method)
Feb 14 03:04:05 at javax.security.auth.Subject.doAs(Subject.java:422)
Feb 14 03:04:05 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
Feb 14 03:04:05 
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1579)
Feb 14 03:04:05 at org.apache.hadoop.ipc.Client.call(Client.java:1525)
Feb 14 03:04:05 at org.apache.hadoop.ipc.Client.call(Client.java:1422)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
Feb 14 03:04:05 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
Feb 14 03:04:05 at com.sun.proxy.$Proxy113.addBlock(Unknown Sourc

[jira] [Created] (FLINK-34434) DefaultSlotStatusSyncer doesn't complete the returned future

2024-02-13 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34434:
-

 Summary: DefaultSlotStatusSyncer doesn't complete the returned 
future
 Key: FLINK-34434
 URL: https://issues.apache.org/jira/browse/FLINK-34434
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.17.2, 1.19.0, 1.20.0
Reporter: Matthias Pohl


When looking into FLINK-34427 (unrelated), I noticed an odd line in 
[DefaultSlotStatusSyncer:155|https://github.com/apache/flink/blob/15fe1653acec45d7c7bac17071e9773a4aa690a4/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DefaultSlotStatusSyncer.java#L155]
 where we complete a future that should be already completed (because the 
callback is triggered after the {{requestFuture}} is already completed in some 
way. Shouldn't we complete the {{returnedFuture}} instead?

I'm keeping the priority at {{Major}} because it doesn't seem to have been an 
issue in the past.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34433) CollectionFunctionsITCase.test failed due to job restart

2024-02-13 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34433:
-

 Summary: CollectionFunctionsITCase.test failed due to job restart
 Key: FLINK-34433
 URL: https://issues.apache.org/jira/browse/FLINK-34433
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7880739697/job/21503460772#step:10:11312

{code}
Error: 02:33:24 02:33:24.955 [ERROR] Tests run: 439, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 56.57 s <<< FAILURE! -- in 
org.apache.flink.table.planner.functions.CollectionFunctionsITCase
Error: 02:33:24 02:33:24.956 [ERROR] 
org.apache.flink.table.planner.functions.CollectionFunctionsITCase.test(TestCase)[81]
 -- Time elapsed: 1.141 s <<< ERROR!
Feb 13 02:33:24 java.lang.RuntimeException: Job restarted
Feb 13 02:33:24 at 
org.apache.flink.streaming.api.operators.collect.UncheckpointedCollectResultBuffer.sinkRestarted(UncheckpointedCollectResultBuffer.java:42)
Feb 13 02:33:24 at 
org.apache.flink.streaming.api.operators.collect.AbstractCollectResultBuffer.dealWithResponse(AbstractCollectResultBuffer.java:87)
Feb 13 02:33:24 at 
org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:124)
Feb 13 02:33:24 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:126)
Feb 13 02:33:24 at 
org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:100)
Feb 13 02:33:24 at 
org.apache.flink.table.planner.connectors.CollectDynamicSink$CloseableRowIteratorWrapper.hasNext(CollectDynamicSink.java:247)
Feb 13 02:33:24 at 
org.assertj.core.internal.Iterators.assertHasNext(Iterators.java:49)
Feb 13 02:33:24 at 
org.assertj.core.api.AbstractIteratorAssert.hasNext(AbstractIteratorAssert.java:60)
Feb 13 02:33:24 at 
org.apache.flink.table.planner.functions.BuiltInFunctionTestBase$ResultTestItem.test(BuiltInFunctionTestBase.java:383)
Feb 13 02:33:24 at 
org.apache.flink.table.planner.functions.BuiltInFunctionTestBase$TestSetSpec.lambda$getTestCase$4(BuiltInFunctionTestBase.java:341)
Feb 13 02:33:24 at 
org.apache.flink.table.planner.functions.BuiltInFunctionTestBase$TestCase.execute(BuiltInFunctionTestBase.java:119)
Feb 13 02:33:24 at 
org.apache.flink.table.planner.functions.BuiltInFunctionTestBase.test(BuiltInFunctionTestBase.java:99)
Feb 13 02:33:24 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 13 02:33:24 at 
java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
Feb 13 02:33:24 at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
Feb 13 02:33:24 at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
Feb 13 02:33:24 at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
Feb 13 02:33:24 at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34428) WindowAggregateITCase#testEventTimeHopWindow_GroupingSets times out

2024-02-12 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34428:
-

 Summary: WindowAggregateITCase#testEventTimeHopWindow_GroupingSets 
times out
 Key: FLINK-34428
 URL: https://issues.apache.org/jira/browse/FLINK-34428
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7866453368/job/21460921339#step:10:15127

{code}
"main" #1 prio=5 os_prio=0 tid=0x7f1770cb7000 nid=0x4ad4d waiting on 
condition [0x7f17711f6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xab48e3a0> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
at 
org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
at 
org.apache.flink.table.planner.runtime.stream.sql.WindowAggregateITCase.testTumbleWindowWithoutOutputWindowColumns(WindowAggregateITCase.scala:477)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34427) ResourceManagerTaskExecutorTest fails fatally (exit code 239)

2024-02-12 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34427:
-

 Summary: ResourceManagerTaskExecutorTest fails fatally (exit code 
239)
 Key: FLINK-34427
 URL: https://issues.apache.org/jira/browse/FLINK-34427
 Project: Flink
  Issue Type: Bug
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7866453350/job/21460921911#step:10:8959

{code}
Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
Error: 02:28:53 02:28:53.220 [ERROR] 
org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
Error: 02:28:53 02:28:53.220 [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Error: 02:28:53 02:28:53.220 [ERROR] Command was /bin/sh -c cd 
'/root/flink/flink-runtime' && '/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' 
'-XX:+UseG1GC' '-Xms256m' '-XX:+IgnoreUnrecognizedVMOptions' 
'--add-opens=java.base/java.util=ALL-UNNAMED' 
'--add-opens=java.base/java.lang=ALL-UNNAMED' 
'--add-opens=java.base/java.net=ALL-UNNAMED' 
'--add-opens=java.base/java.io=ALL-UNNAMED' 
'--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '-Xmx768m' '-jar' 
'/root/flink/flink-runtime/target/surefire/surefirebooter-20240212022332296_94.jar'
 '/root/flink/flink-runtime/target/surefire' '2024-02-12T02-21-39_495-jvmRun3' 
'surefire-20240212022332296_88tmp' 'surefire_26-20240212022332296_91tmp'
Error: 02:28:53 02:28:53.220 [ERROR] Error occurred in starting fork, check 
output in log
Error: 02:28:53 02:28:53.220 [ERROR] Process Exit Code: 239
Error: 02:28:53 02:28:53.220 [ERROR] Crashed tests:
Error: 02:28:53 02:28:53.221 [ERROR] 
org.apache.flink.runtime.resourcemanager.ResourceManagerTaskExecutorTest
Error: 02:28:53 02:28:53.221 [ERROR]at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:456)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34426) HybridShuffleITCase.testHybridSelectiveExchangesRestart times out

2024-02-12 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34426:
-

 Summary: HybridShuffleITCase.testHybridSelectiveExchangesRestart 
times out
 Key: FLINK-34426
 URL: https://issues.apache.org/jira/browse/FLINK-34426
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7851900779/job/21429781783#step:10:9052

{code}
"ForkJoinPool-1-worker-3" #16 daemon prio=5 os_prio=0 cpu=3397.79ms 
elapsed=11462.88s tid=0x7f48966b3800 nid=0x7a303 waiting on condition  
[0x7f486e97a000]
   java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.19/Native Method)
- parking to wait for  <0xa2faa230> (a 
java.util.concurrent.CompletableFuture$Signaller)
at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.19/LockSupport.java:194)
at 
java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.19/CompletableFuture.java:1796)
at 
java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.19/ForkJoinPool.java:3118)
at 
java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.19/CompletableFuture.java:1823)
at 
java.util.concurrent.CompletableFuture.get(java.base@11.0.19/CompletableFuture.java:1998)
at 
org.apache.flink.util.AutoCloseableAsync.close(AutoCloseableAsync.java:36)
at 
org.apache.flink.test.runtime.JobGraphRunningUtil.execute(JobGraphRunningUtil.java:61)
at 
org.apache.flink.test.runtime.BatchShuffleITCaseBase.executeJob(BatchShuffleITCaseBase.java:117)
at 
org.apache.flink.test.runtime.HybridShuffleITCase.testHybridSelectiveExchangesRestart(HybridShuffleITCase.java:79)
at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.19/Native 
Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34425) TaskManagerRunnerITCase#testNondeterministicWorkingDirIsDeletedInCaseOfProcessFailure times out

2024-02-12 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34425:
-

 Summary: 
TaskManagerRunnerITCase#testNondeterministicWorkingDirIsDeletedInCaseOfProcessFailure
 times out
 Key: FLINK-34425
 URL: https://issues.apache.org/jira/browse/FLINK-34425
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7851900616/job/21429757962#step:10:8844

{code}
Feb 10 03:21:45 "main" #1 [498632] prio=5 os_prio=0 cpu=619.91ms 
elapsed=1653.40s tid=0x7fbd29695000 nid=498632 waiting on condition  
[0x7fbd2b9f3000]
Feb 10 03:21:45java.lang.Thread.State: WAITING (parking)
Feb 10 03:21:45 at 
jdk.internal.misc.Unsafe.park(java.base@21.0.1/Native Method)
Feb 10 03:21:45 - parking to wait for  <0xae6199f0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
Feb 10 03:21:45 at 
java.util.concurrent.locks.LockSupport.park(java.base@21.0.1/LockSupport.java:371)
Feb 10 03:21:45 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21.0.1/AbstractQueuedSynchronizer.java:519)
Feb 10 03:21:45 at 
java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21.0.1/ForkJoinPool.java:3780)
Feb 10 03:21:45 at 
java.util.concurrent.ForkJoinPool.managedBlock(java.base@21.0.1/ForkJoinPool.java:3725)
Feb 10 03:21:45 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21.0.1/AbstractQueuedSynchronizer.java:1707)
Feb 10 03:21:45 at 
java.lang.ProcessImpl.waitFor(java.base@21.0.1/ProcessImpl.java:425)
Feb 10 03:21:45 at 
org.apache.flink.test.recovery.TaskManagerRunnerITCase.testNondeterministicWorkingDirIsDeletedInCaseOfProcessFailure(TaskManagerRunnerITCase.java:126)
Feb 10 03:21:45 at 
java.lang.invoke.LambdaForm$DMH/0x7fbccb1b8000.invokeVirtual(java.base@21.0.1/LambdaForm$DMH)
Feb 10 03:21:45 at 
java.lang.invoke.LambdaForm$MH/0x7fbccb1b8800.invoke(java.base@21.0.1/LambdaForm$MH)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34424) BoundedBlockingSubpartitionWriteReadTest#testRead10ConsumersConcurrent times out

2024-02-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34424:
-

 Summary: 
BoundedBlockingSubpartitionWriteReadTest#testRead10ConsumersConcurrent times out
 Key: FLINK-34424
 URL: https://issues.apache.org/jira/browse/FLINK-34424
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57446=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=9151

{code}
Feb 11 13:55:29 "ForkJoinPool-50-worker-25" #414 daemon prio=5 os_prio=0 
tid=0x7f19503af800 nid=0x284c in Object.wait() [0x7f191b6db000]
Feb 11 13:55:29java.lang.Thread.State: WAITING (on object monitor)
Feb 11 13:55:29 at java.lang.Object.wait(Native Method)
Feb 11 13:55:29 at java.lang.Thread.join(Thread.java:1252)
Feb 11 13:55:29 - locked <0xe2e019a8> (a 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionWriteReadTest$LongReader)
Feb 11 13:55:29 at 
org.apache.flink.core.testutils.CheckedThread.trySync(CheckedThread.java:104)
Feb 11 13:55:29 at 
org.apache.flink.core.testutils.CheckedThread.sync(CheckedThread.java:92)
Feb 11 13:55:29 at 
org.apache.flink.core.testutils.CheckedThread.sync(CheckedThread.java:81)
Feb 11 13:55:29 at 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionWriteReadTest.testRead10ConsumersConcurrent(BoundedBlockingSubpartitionWriteReadTest.java:177)
Feb 11 13:55:29 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34423) Make tool/ci/compile_ci.sh not necessarily rely on clean phase

2024-02-11 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34423:
-

 Summary: Make tool/ci/compile_ci.sh not necessarily rely on clean 
phase
 Key: FLINK-34423
 URL: https://issues.apache.org/jira/browse/FLINK-34423
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


The GHA job {{Test packaging/licensing}} job runs 
[.github/workflows/template.flink-ci.yml:169|https://github.com/apache/flink/blob/85edd784fc72c1784849e2b122cbf3215f89817c/.github/workflows/template.flink-ci.yml#L169]
 which enables Maven's {{clean}} phase. This triggers redundant work because 
the {{Test packaging/licensing}} job wouldn't utilize the build artifacts of 
the previous {{Compile}} job but rerun the {{test-compile}} once more.

Disabling {{clean}} should improve the runtime of the {{Test 
packaging/licensing}} job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34419) flink-docker's .github/workflows/snapshot.yml doesn't support JDK 17 and 21

2024-02-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34419:
-

 Summary: flink-docker's .github/workflows/snapshot.yml doesn't 
support JDK 17 and 21
 Key: FLINK-34419
 URL: https://issues.apache.org/jira/browse/FLINK-34419
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / CI
Reporter: Matthias Pohl


[.github/workflows/snapshot.yml|https://github.com/apache/flink-docker/blob/master/.github/workflows/snapshot.yml#L40]
 needs to be updated: JDK 17 support was added in 1.18 (FLINK-15736). JDK 21 
support was added in 1.19 (FLINK-33163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34418) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots fail

2024-02-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34418:
-

 Summary: 
YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
 failed due to disk space
 Key: FLINK-34418
 URL: https://issues.apache.org/jira/browse/FLINK-34418
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


[https://github.com/apache/flink/actions/runs/7838691874/job/21390739806#step:10:27746]
{code:java}
[...]
Feb 09 03:00:13 Caused by: java.io.IOException: No space left on device
27608Feb 09 03:00:13at java.io.FileOutputStream.writeBytes(Native Method)
27609Feb 09 03:00:13at 
java.io.FileOutputStream.write(FileOutputStream.java:326)
27610Feb 09 03:00:13at 
org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:250)
27611Feb 09 03:00:13... 39 more
[...] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34416) "Local recovery and sticky scheduling end-to-end test" still doesn't work with AdaptiveScheduler

2024-02-08 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34416:
-

 Summary: "Local recovery and sticky scheduling end-to-end test" 
still doesn't work with AdaptiveScheduler
 Key: FLINK-34416
 URL: https://issues.apache.org/jira/browse/FLINK-34416
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


We tried to enable all {{AdaptiveScheduler}}-related tests in FLINK-34409 
because it appeared that all Jira issues that were referenced are resolved. 
That's not the case for the {{"Local recovery and sticky scheduling end-to-end 
test"}} tests, though.

With the {{AdaptiveScheduler}} being enabled, we run into issues where the test 
runs forever due to a {{NullPointerException}} continuously triggering a 
failure:
{code}
Feb 07 19:02:59 2024-02-07 19:02:21,706 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map -> 
Sink: Unnamed (3/4) 
(54075d3d22edb729e5f396726f777860_20ba6b65f97481d5570070de90e4e791_2_16292) 
switched from INITIALIZING to FAILED on localhost:40893-09ff7>
Feb 07 19:02:59 java.lang.NullPointerException: Expected to find info here.
Feb 07 19:02:59 at 
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76) 
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.tests.StickyAllocationAndLocalRecoveryTestJob$StateCreatingFlatMap.initializeState(StickyAllocationAndLocalRecoveryTestJob.java:340)
 ~[?:?]
Feb 07 19:02:59 at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:187)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:169)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:134)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:799)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:753)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:753)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:712)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
 ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) 
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:751) 
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at 
org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) 
~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
Feb 07 19:02:59 at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
{code}

This error is caused by a Precondition in 
[StickyAllocationAndLocalRecoveryTestJob:340|https://github.com/apache/flink/blob/0f3470db83c1fddba9ac9a7299b1e61baab4ff12/flink-end-to-end-tests/flink-local-recovery-and-allocation-test/src/main/java/org/apache/flink/streaming/tests/StickyAllocationAndLocalRecoveryTestJob.java#L340]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34412) ResultPartitionDeploymentDescriptorTest fails due to fatal error (239 exit code)

2024-02-08 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34412:
-

 Summary: ResultPartitionDeploymentDescriptorTest fails due to 
fatal error (239 exit code)
 Key: FLINK-34412
 URL: https://issues.apache.org/jira/browse/FLINK-34412
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.17.2
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57388=logs=77a9d8e1-d610-59b3-fc2a-4766541e0e33=125e07e7-8de0-5c6c-a541-a567415af3ef=8323

{code}
Feb 08 04:56:31 [ERROR] 
org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptorTest
Feb 08 04:56:31 [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Feb 08 04:56:31 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m 
-jar 
/__w/1/s/flink-runtime/target/surefire/surefirebooter6684124987290515696.jar 
/__w/1/s/flink-runtime/target/surefire 2024-02-08T04-45-49_396-jvmRun4 
surefire6142105262662423760tmp surefire_245661504424247139476tmp
Feb 08 04:56:31 [ERROR] Error occurred in starting fork, check output in log
Feb 08 04:56:31 [ERROR] Process Exit Code: 239
Feb 08 04:56:31 [ERROR] Crashed tests:
Feb 08 04:56:31 [ERROR] 
org.apache.flink.runtime.deployment.ResultPartitionDeploymentDescriptorTest
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159)
Feb 08 04:56:31 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34411) "Wordcount on Docker test (custom fs plugin)" timed out with some strange issue while setting the test up

2024-02-08 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34411:
-

 Summary: "Wordcount on Docker test (custom fs plugin)" timed out 
with some strange issue while setting the test up
 Key: FLINK-34411
 URL: https://issues.apache.org/jira/browse/FLINK-34411
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57380=logs=bea52777-eaf8-5663-8482-18fbc3630e81=43ba8ce7-ebbf-57cd-9163-444305d74117=5802

{code}
Feb 07 15:22:39 
==
Feb 07 15:22:39 Running 'Wordcount on Docker test (custom fs plugin)'
Feb 07 15:22:39 
==
Feb 07 15:22:39 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-39516987853
Feb 07 15:22:40 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Feb 07 15:22:40 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin/flink-1.19-SNAPSHOT
Feb 07 15:22:41 Docker version 24.0.7, build afdd53b
Feb 07 15:22:44 docker-compose version 1.29.2, build 5becea4c
Feb 07 15:22:44 Starting fileserver for Flink distribution
Feb 07 15:22:44 ~/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin ~/work/1/s
Feb 07 15:23:07 ~/work/1/s
Feb 07 15:23:07 
~/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-39516987853 
~/work/1/s
Feb 07 15:23:07 Preparing Dockeriles
Feb 07 15:23:07 Executing command: git clone 
https://github.com/apache/flink-docker.git --branch dev-1.19 --single-branch
Cloning into 'flink-docker'...
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common_docker.sh: line 
65: ./add-custom.sh: No such file or directory
Feb 07 15:23:07 Building images
ERROR: unable to prepare context: path "dev/test_docker_embedded_job-ubuntu" 
not found
Feb 07 15:23:09 ~/work/1/s
Feb 07 15:23:09 Command: build_image test_docker_embedded_job failed. 
Retrying...
Feb 07 15:23:14 Starting fileserver for Flink distribution
Feb 07 15:23:14 ~/work/1/s/flink-dist/target/flink-1.19-SNAPSHOT-bin ~/work/1/s
Feb 07 15:23:36 ~/work/1/s
Feb 07 15:23:36 
~/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-39516987853 
~/work/1/s
Feb 07 15:23:36 Preparing Dockeriles
Feb 07 15:23:36 Executing command: git clone 
https://github.com/apache/flink-docker.git --branch dev-1.19 --single-branch
fatal: destination path 'flink-docker' already exists and is not an empty 
directory.
Feb 07 15:23:36 Retry 1/5 exited 128, retrying in 1 seconds...
Traceback (most recent call last):
  File 
"/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/python3_fileserver.py",
 line 26, in 
httpd = socketserver.TCPServer(("", ), handler)
  File "/usr/lib/python3.8/socketserver.py", line 452, in __init__
self.server_bind()
  File "/usr/lib/python3.8/socketserver.py", line 466, in server_bind
self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34410) Disable nightly trigger in forks

2024-02-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34410:
-

 Summary: Disable nightly trigger in forks
 Key: FLINK-34410
 URL: https://issues.apache.org/jira/browse/FLINK-34410
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / CI
Affects Versions: 1.20.0
Reporter: Matthias Pohl


We can disable the automatic triggering of the nightly trigger workflow in fork 
(see [GHA 
docs|https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions]s:
{code}
if: github.repository == 'octo-org/octo-repo-prod'
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Alternative way of posting FLIPs

2024-02-07 Thread Matthias Pohl
+1 for option 1 since it's a reasonable temporary workaround

Moving to GitHub discussions would either mean moving the current FLIP
collection or having the FLIPs in two locations. Both options do not seem
to be optimal. Another concern I had was that GitHub Discussions wouldn't
allow integrating diagrams that easily. But it looks like they support
Mermaid [1] for diagrams.

One flaw of the GoogleDocs approach is, though, that we have to rely on
diagrams being provided as PNG/JPG/SVG rather than draw.io diagrams. draw.io
is more tightly integrated with the Confluence wiki which allows
editing/updating diagrams in the wiki rather than using some external tool.
Google Draw is also not that convenient to use in my opinion. Anyway,
that's a minor issue, I guess.

Matthias

[1]
https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams

On Wed, Feb 7, 2024 at 3:30 PM Lincoln Lee  wrote:

> Thanks Martijn moving this forward!
>
> +1 for the first solution, because as of now it looks like this is a
> temporary solution and we're still looking forward to the improvement by
> ASF Infra, when the access is ok for contributors, we can back to the
> current workflow.
>
> For solution 2, one visible downside is that it becomes inconvenient to
> look for flips (unless we permanently switch to github discussion).
>
> Looking forward to hearing more thoughts.
>
> Best,
> Lincoln Lee
>
>
> Martijn Visser  于2024年2月7日周三 21:51写道:
>
> > Hi all,
> >
> > ASF Infra has confirmed to me that only ASF committers can access the
> > ASF Confluence site since a recent change. One of the results of this
> > decision is that users can't signup and access Confluence, so only
> > committers+ can create FLIPs.
> >
> > ASF Infra hopes to improve this situation when they move to the Cloud
> > shortly (as in: some months), but they haven't committed on an actual
> > date. The idea would be that we find a temporary solution until anyone
> > can request access to Confluence.
> >
> > There are a couple of ways we could resolve this situation:
> > 1. Contributors create a Google Doc and make that view-only, and post
> > that Google Doc to the mailing list for a discussion thread. When the
> > discussions have been resolved, the contributor ask on the Dev mailing
> > list to a committer/PMC to copy the contents from the Google Doc, and
> > create a FLIP number for them. The contributor can then use that FLIP
> > to actually have a VOTE thread.
> > 2. We could consider moving FLIPs to "Discussions" on Github, like
> > Airflow does at https://github.com/apache/airflow/discussions
> > 3. Perhaps someone else has another good idea.
> >
> > Looking forward to your thoughts.
> >
> > Best regards,
> >
> > Martijn
> >
>


[jira] [Created] (FLINK-34409) Increase test coverage for AdaptiveScheduler

2024-02-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34409:
-

 Summary: Increase test coverage for AdaptiveScheduler
 Key: FLINK-34409
 URL: https://issues.apache.org/jira/browse/FLINK-34409
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Affects Versions: 1.18.1, 1.19.0, 1.20.0
Reporter: Matthias Pohl


There are still several tests disabled for the {{AdaptiveScheduler}} which we 
can enable now. All the issues seem to have been fixed.

We can even remove the annotation {{@FailsWithAdaptiveScheduler}} now. It's not 
needed anymore.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34408) VeryBigPbProtoToRowTest#testSimple fails with OOM

2024-02-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34408:
-

 Summary: VeryBigPbProtoToRowTest#testSimple fails with OOM
 Key: FLINK-34408
 URL: https://issues.apache.org/jira/browse/FLINK-34408
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57371=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=23861

{code}
Feb 07 09:40:16 09:40:16.314 [ERROR] Tests run: 1, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 29.58 s <<< FAILURE! -- in 
org.apache.flink.formats.protobuf.VeryBigPbProtoToRowTest
Feb 07 09:40:16 09:40:16.314 [ERROR] 
org.apache.flink.formats.protobuf.VeryBigPbProtoToRowTest.testSimple -- Time 
elapsed: 29.57 s <<< ERROR!
Feb 07 09:40:16 org.apache.flink.util.FlinkRuntimeException: Error in 
serialization.
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:327)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:162)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:1007)
Feb 07 09:40:16 at 
org.apache.flink.client.StreamGraphTranslator.translateToJobGraph(StreamGraphTranslator.java:56)
Feb 07 09:40:16 at 
org.apache.flink.client.FlinkPipelineTranslationUtil.getJobGraph(FlinkPipelineTranslationUtil.java:45)
Feb 07 09:40:16 at 
org.apache.flink.client.deployment.executors.PipelineExecutorUtils.getJobGraph(PipelineExecutorUtils.java:61)
Feb 07 09:40:16 at 
org.apache.flink.client.deployment.executors.LocalExecutor.getJobGraph(LocalExecutor.java:104)
Feb 07 09:40:16 at 
org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:81)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2440)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2421)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.datastream.DataStream.executeAndCollectWithClient(DataStream.java:1495)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1382)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.datastream.DataStream.executeAndCollect(DataStream.java:1367)
Feb 07 09:40:16 at 
org.apache.flink.formats.protobuf.ProtobufTestHelper.validateRow(ProtobufTestHelper.java:66)
Feb 07 09:40:16 at 
org.apache.flink.formats.protobuf.ProtobufTestHelper.pbBytesToRow(ProtobufTestHelper.java:121)
Feb 07 09:40:16 at 
org.apache.flink.formats.protobuf.ProtobufTestHelper.pbBytesToRow(ProtobufTestHelper.java:103)
Feb 07 09:40:16 at 
org.apache.flink.formats.protobuf.ProtobufTestHelper.pbBytesToRow(ProtobufTestHelper.java:98)
Feb 07 09:40:16 at 
org.apache.flink.formats.protobuf.VeryBigPbProtoToRowTest.testSimple(VeryBigPbProtoToRowTest.java:36)
Feb 07 09:40:16 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 07 09:40:16 Caused by: java.util.concurrent.ExecutionException: 
java.lang.OutOfMemoryError: Java heap space
Feb 07 09:40:16 at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
Feb 07 09:40:16 at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:323)
Feb 07 09:40:16 ... 18 more
Feb 07 09:40:16 Caused by: java.lang.OutOfMemoryError: Java heap space
Feb 07 09:40:16 at java.util.Arrays.copyOf(Arrays.java:3236)
Feb 07 09:40:16 at 
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
Feb 07 09:40:16 at 
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:555)
Feb 07 09:40:16 at 
org.apache.flink.util.InstantiationUtil.writeObjectToConfig(InstantiationUtil.java:486)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamConfig.lambda$triggerSerializationAndReturnFuture$0(StreamConfig.java:182)
Feb 07 09:40:16 at 
org.apache.flink.streaming.api.graph.StreamConfig$$Lambda$1582/1961611609.accept(Unknown
 Source)
Feb 07 09:40:16 at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
Feb 07 09:40:16 at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
Feb 07 09:40:16 at 
java.util.concurrent.CompletableFuture$Completion.run(Comp

[jira] [Created] (FLINK-34405) RightOuterJoinTaskTest#testCancelOuterJoinTaskWhileSort2 fails

2024-02-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34405:
-

 Summary: RightOuterJoinTaskTest#testCancelOuterJoinTaskWhileSort2 
fails
 Key: FLINK-34405
 URL: https://issues.apache.org/jira/browse/FLINK-34405
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57357=logs=d89de3df-4600-5585-dadc-9bbc9a5e661c=be5a4b15-4b23-56b1-7582-795f58a645a2=9027

{code}
Feb 07 03:20:16 03:20:16.223 [ERROR] Failures: 
Feb 07 03:20:16 03:20:16.223 [ERROR] 
org.apache.flink.runtime.operators.RightOuterJoinTaskTest.testCancelOuterJoinTaskWhileSort2
Feb 07 03:20:16 03:20:16.223 [ERROR]   Run 1: 
RightOuterJoinTaskTest>AbstractOuterJoinTaskTest.testCancelOuterJoinTaskWhileSort2:435
 
Feb 07 03:20:16 expected: 
Feb 07 03:20:16   null
Feb 07 03:20:16  but was: 
Feb 07 03:20:16   java.lang.Exception: The data preparation caused an error: 
Interrupted
Feb 07 03:20:16 at 
org.apache.flink.runtime.operators.testutils.BinaryOperatorTestBase.testDriverInternal(BinaryOperatorTestBase.java:209)
Feb 07 03:20:16 at 
org.apache.flink.runtime.operators.testutils.BinaryOperatorTestBase.testDriver(BinaryOperatorTestBase.java:189)
Feb 07 03:20:16 at 
org.apache.flink.runtime.operators.AbstractOuterJoinTaskTest.access$100(AbstractOuterJoinTaskTest.java:48)
Feb 07 03:20:16 ...(1 remaining lines not displayed - this can be 
changed with Assertions.setMaxStackTraceElementsDisplayed)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34404) RestoreTestBase#testRestore times out

2024-02-07 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34404:
-

 Summary: RestoreTestBase#testRestore times out
 Key: FLINK-34404
 URL: https://issues.apache.org/jira/browse/FLINK-34404
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.19.0, 1.20.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57357=logs=32715a4c-21b8-59a3-4171-744e5ab107eb=ff64056b-5320-5afe-c22c-6fa339e59586=11603

{code}
Feb 07 02:17:40 "ForkJoinPool-74-worker-1" #382 daemon prio=5 os_prio=0 
cpu=282.22ms elapsed=961.78s tid=0x7f880a485c00 nid=0x6745 waiting on 
condition  [0x7f878a6f9000]
Feb 07 02:17:40java.lang.Thread.State: WAITING (parking)
Feb 07 02:17:40 at 
jdk.internal.misc.Unsafe.park(java.base@17.0.7/Native Method)
Feb 07 02:17:40 - parking to wait for  <0xff73d060> (a 
java.util.concurrent.CompletableFuture$Signaller)
Feb 07 02:17:40 at 
java.util.concurrent.locks.LockSupport.park(java.base@17.0.7/LockSupport.java:211)
Feb 07 02:17:40 at 
java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.7/CompletableFuture.java:1864)
Feb 07 02:17:40 at 
java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@17.0.7/ForkJoinPool.java:3449)
Feb 07 02:17:40 at 
java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.7/ForkJoinPool.java:3432)
Feb 07 02:17:40 at 
java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.7/CompletableFuture.java:1898)
Feb 07 02:17:40 at 
java.util.concurrent.CompletableFuture.get(java.base@17.0.7/CompletableFuture.java:2072)
Feb 07 02:17:40 at 
org.apache.flink.table.planner.plan.nodes.exec.testutils.RestoreTestBase.testRestore(RestoreTestBase.java:292)
Feb 07 02:17:40 at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.7/Native 
Method)
Feb 07 02:17:40 at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.7/NativeMethodAccessorImpl.java:77)
Feb 07 02:17:40 at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.7/DelegatingMethodAccessorImpl.java:43)
Feb 07 02:17:40 at 
java.lang.reflect.Method.invoke(java.base@17.0.7/Method.java:568)
Feb 07 02:17:40 at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:728)
[...]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Matthias Pohl
That's stated in the Jira issue. I didn't have the time to investigate it
further.

On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir  wrote:

> Hi Matthias,
> Thanks for the suggestion. Do we know which part of code caused this issue
> and how it was fixed?
>
> Thanks!
>
> On Mon, 5 Feb 2024 at 18:06, Matthias Pohl  .invalid>
> wrote:
>
> > Hi Lavkesh,
> > FLINK-33998 [1] sounds quite similar to what you describe.
> >
> > The solution was to upgrade to Flink version 1.14.6. I didn't have the
> > capacity to look into the details considering that the mentioned Flink
> > version 1.14 is not officially supported by the community anymore and a
> fix
> > seems to have been provided with a newer version.
> >
> > Matthias
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-33998
> >
> > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir 
> wrote:
> >
> > > Hii, Few more details:
> > > We are running GKE version 1.27.7-gke.1121002.
> > > and using flink version 1.14.3.
> > >
> > > Thanks!
> > >
> > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir 
> wrote:
> > >
> > > > Hii All,
> > > >
> > > > We run a Flink operator on GKE, deploying one Flink job per job
> > manager.
> > > > We utilize
> > > >
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > for high availability. The JobManager employs config maps for
> > > checkpointing
> > > > and leader election. If, at any point, the Kube API server returns an
> > > error
> > > > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > > > happening every 1-2 days for some jobs among the 400 running in the
> > same
> > > > cluster, each with its JobManager pod.
> > > >
> > > > What might be causing these errors from the Kube? One possibility is
> > that
> > > > when the JM writes the config map and attempts to retrieve it
> > immediately
> > > > after, it could result in a 404 error.
> > > > Are there any configurations to increase heartbeat or timeouts that
> > might
> > > > be causing temporary disconnections from the Kube API server?
> > > >
> > > > Thank you!
> > > >
> > >
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Matthias Pohl
Hi Lavkesh,
FLINK-33998 [1] sounds quite similar to what you describe.

The solution was to upgrade to Flink version 1.14.6. I didn't have the
capacity to look into the details considering that the mentioned Flink
version 1.14 is not officially supported by the community anymore and a fix
seems to have been provided with a newer version.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-33998

On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir  wrote:

> Hii, Few more details:
> We are running GKE version 1.27.7-gke.1121002.
> and using flink version 1.14.3.
>
> Thanks!
>
> On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir  wrote:
>
> > Hii All,
> >
> > We run a Flink operator on GKE, deploying one Flink job per job manager.
> > We utilize
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > for high availability. The JobManager employs config maps for
> checkpointing
> > and leader election. If, at any point, the Kube API server returns an
> error
> > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > happening every 1-2 days for some jobs among the 400 running in the same
> > cluster, each with its JobManager pod.
> >
> > What might be causing these errors from the Kube? One possibility is that
> > when the JM writes the config map and attempts to retrieve it immediately
> > after, it could result in a 404 error.
> > Are there any configurations to increase heartbeat or timeouts that might
> > be causing temporary disconnections from the Kube API server?
> >
> > Thank you!
> >
>


[jira] [Created] (FLINK-34361) PyFlink end-to-end test fails in GHA

2024-02-05 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34361:
-

 Summary: PyFlink end-to-end test fails in GHA
 Key: FLINK-34361
 URL: https://issues.apache.org/jira/browse/FLINK-34361
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.19.0
Reporter: Matthias Pohl


"PyFlink end-to-end test" fails:
https://github.com/apache/flink/actions/runs/7778642859/job/21208811659#step:14:7420

The only error I could identify is:
{code}
ERROR: pip's dependency resolver does not currently take into account all the 
packages that are installed. This behaviour is the source of the following 
dependency conflicts.
conda 23.5.2 requires ruamel-yaml<0.18,>=0.11.14, but you have ruamel-yaml 
0.18.5 which is incompatible.
Feb 05 03:31:54 Successfully installed apache-beam-2.48.0 avro-python3-1.10.2 
cloudpickle-2.2.1 crcmod-1.7 cython-3.0.8 dill-0.3.1.1 dnspython-2.5.0 
docopt-0.6.2 exceptiongroup-1.2.0 fastavro-1.9.3 fasteners-0.19 
find-libpython-0.3.1 grpcio-1.50.0 grpcio-tools-1.50.0 hdfs-2.7.3 
httplib2-0.22.0 iniconfig-2.0.0 numpy-1.24.4 objsize-0.6.1 orjson-3.9.13 
pandas-2.2.0 pemja-0.4.1 proto-plus-1.23.0 protobuf-4.23.4 py4j-0.10.9.7 
pyarrow-11.0.0 pydot-1.4.2 pymongo-4.6.1 pyparsing-3.1.1 pytest-7.4.4 
python-dateutil-2.8.2 pytz-2024.1 regex-2023.12.25 ruamel.yaml-0.18.5 
ruamel.yaml.clib-0.2.8 tomli-2.0.1 typing-extensions-4.9.0 tzdata-2023.4
/home/runner/work/flink/flink/flink-python/dev/.conda/lib/python3.10/site-packages/Cython/Compiler/Main.py:381:
 FutureWarning: Cython directive 'language_level' not set, using '3str' for now 
(Py3). This has changed from earlier releases! File: 
/home/runner/work/flink/flink/flink-python/pyflink/fn_execution/table/window_aggregate_fast.pxd
  tree = Parsing.p_module(s, pxd, full_module_name)
{code}
Not sure whether that's the actual cause.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34360) GHA e2e test failure due to no space left on device error

2024-02-05 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34360:
-

 Summary: GHA e2e test failure due to no space left on device error
 Key: FLINK-34360
 URL: https://issues.apache.org/jira/browse/FLINK-34360
 Project: Flink
  Issue Type: Bug
  Components: Tests
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7763815214

{code}
AdaptiveScheduler / E2E (group 2)
Process completed with exit code 1.
AdaptiveScheduler / E2E (group 2)
You are running out of disk space. The runner will stop working when the 
machine runs out of disk space. Free space left: 35 MB
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34359) "Kerberized YARN per-job on Docker test (default input)" failed due to IllegalStateException

2024-02-04 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34359:
-

 Summary: "Kerberized YARN per-job on Docker test (default input)" 
failed due to IllegalStateException
 Key: FLINK-34359
 URL: https://issues.apache.org/jira/browse/FLINK-34359
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.18.1
Reporter: Matthias Pohl


This looks similar to FLINK-34357 because it's also due to some YARN issue. But 
the e2e test "Kerberized YARN per-job on Docker test (default input)" is 
causing the failure:
{code}
[...]
Exception in thread "Thread-4" java.lang.IllegalStateException: Trying to 
access closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
at 
org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:184)
at 
org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResource(FlinkUserCodeClassLoaders.java:208)
at 
org.apache.hadoop.conf.Configuration.getResource(Configuration.java:2570)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2801)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2776)
at 
org.apache.hadoop.conf.Configuration.loadProps(Configuration.java:2654)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2636)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1100)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1707)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1688)
at 
org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
at 
org.apache.hadoop.util.ShutdownHookManager.shutdownExecutor(ShutdownHookManager.java:145)
at 
org.apache.hadoop.util.ShutdownHookManager.access$300(ShutdownHookManager.java:65)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:102)
{code}

https://github.com/apache/flink/actions/runs/7770984519/job/21191905887#step:14:11720



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34357) IllegalAnnotationsException causes "PyFlink YARN per-job on Docker test" e2e test to fail

2024-02-04 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34357:
-

 Summary: IllegalAnnotationsException causes "PyFlink YARN per-job 
on Docker test" e2e test to fail
 Key: FLINK-34357
 URL: https://issues.apache.org/jira/browse/FLINK-34357
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.18.1
Reporter: Matthias Pohl


https://github.com/apache/flink/actions/runs/7763815214/job/21176570116#step:14:10009

{code}
Feb 03 03:29:04 SEVERE: Failed to generate the schema for the JAX-B elements
Feb 03 03:29:04 javax.xml.bind.JAXBException
Feb 03 03:29:04  - with linked exception:
Feb 03 03:29:04 [java.lang.reflect.InvocationTargetException]
Feb 03 03:29:04 at 
javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:262)
Feb 03 03:29:04 at 
javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:234)
[...]
Feb 03 03:29:04 at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Feb 03 03:29:04 Caused by: java.lang.reflect.InvocationTargetException
Feb 03 03:29:04 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Feb 03 03:29:04 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Feb 03 03:29:04 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Feb 03 03:29:04 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 03 03:29:04 at 
org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.ContextFactory.createContext(ContextFactory.java:44)
Feb 03 03:29:04 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Feb 03 03:29:04 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Feb 03 03:29:04 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Feb 03 03:29:04 at java.lang.reflect.Method.invoke(Method.java:498)
Feb 03 03:29:04 at 
javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:247)
Feb 03 03:29:04 ... 57 more
Feb 03 03:29:04 Caused by: 
com.sun.xml.internal.bind.v2.runtime.IllegalAnnotationsException: 1 counts of 
IllegalAnnotationExceptions
Feb 03 03:29:04 java.util.Set is an interface, and JAXB can't handle interfaces.
Feb 03 03:29:04 this problem is related to the following location:
Feb 03 03:29:04 at java.util.Set
Feb 03 03:29:04 at public java.util.HashMap 
org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getPrimaryFiltersJAXB()
Feb 03 03:29:04 at 
org.apache.hadoop.yarn.api.records.timeline.TimelineEntity
Feb 03 03:29:04 at public java.util.List 
org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities()
Feb 03 03:29:04 at 
org.apache.hadoop.yarn.api.records.timeline.TimelineEntities
Feb 03 03:29:04 
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:91)
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:445)
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:277)
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:124)
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1123)
Feb 03 03:29:04 at 
com.sun.xml.internal.bind.v2.ContextFactory.createContext(ContextFactory.java:147)
Feb 03 03:29:04 ... 67 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34343) ResourceManager registration is not completed when registering the JobMaster

2024-02-02 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34343:
-

 Summary: ResourceManager registration is not completed when 
registering the JobMaster
 Key: FLINK-34343
 URL: https://issues.apache.org/jira/browse/FLINK-34343
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Runtime / RPC
Affects Versions: 1.19.0
Reporter: Matthias Pohl


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57203=logs=64debf87-ecdb-5aef-788d-8720d341b5cb=2302fb98-0839-5df2-3354-bbae636f81a7=8066

The test run failed due to a NullPointerException:
{code}
Feb 02 01:11:55 2024-02-02 01:11:47,791 INFO  
org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor   [] - The rpc 
endpoint org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager 
has not been started yet. Discarding message 
LocalFencedMessage(000
0, 
LocalRpcInvocation(ResourceManagerGateway.registerJobMaster(JobMasterId, 
ResourceID, String, JobID, Time))) until processing is started.
Feb 02 01:11:55 2024-02-02 01:11:47,797 WARN  
org.apache.flink.runtime.rpc.pekko.SupervisorActor   [] - RpcActor 
pekko://flink/user/rpc/resourcemanager_2 has failed. Shutting it down now.
Feb 02 01:11:55 java.lang.NullPointerException: Cannot invoke 
"org.apache.flink.runtime.rpc.RpcServer.getAddress()" because "this.rpcServer" 
is null
Feb 02 01:11:55 at 
org.apache.flink.runtime.rpc.RpcEndpoint.getAddress(RpcEndpoint.java:322) 
~[flink-dist-1.19-SNAPSHOT.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:182)
 ~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:127) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at 
org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) 
~[flink-rpc-akka06a9bb81-2e68-483a-b236-a283d0b1d097.jar:1.19-SNAPSHOT]
Feb 02 01:11:55 at java.util.concurrent.ForkJoinTask.doExec(Unknown 
Source) ~[?:?]
Feb 02 01:11:55 at 
java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) ~[?:?]
Feb 02 01:11:55 at java.util.concurrent.ForkJoinPool.scan(Unknown 
Source) ~[?:?]
Feb 02 01:11:55 at java.util.concurrent.ForkJoinPool.runWorker(Unknown 
Source) ~[?:?]
Feb 02 01:11:55 at 
java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) ~[?:?]
{code

[jira] [Created] (FLINK-34333) Fix FLINK-34007 LeaderElector bug in 1.18

2024-02-01 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34333:
-

 Summary: Fix FLINK-34007 LeaderElector bug in 1.18
 Key: FLINK-34333
 URL: https://issues.apache.org/jira/browse/FLINK-34333
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.18.1
Reporter: Matthias Pohl


FLINK-34007 revealed a bug in the k8s client v6.6.2 which we're using since 
Flink 1.18. This issue was fixed with FLINK-34007 for Flink 1.19 which required 
an update of the k8s client to v6.9.0.

This Jira issue is about finding a solution in Flink 1.18 for the very same 
problem FLINK-34007 covered. It's a dedicated Jira issue because we want to 
unblock the release of 1.19 by resolving FLINK-34007.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34332) Investigate the permissions

2024-02-01 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34332:
-

 Summary: Investigate the permissions
 Key: FLINK-34332
 URL: https://issues.apache.org/jira/browse/FLINK-34332
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0
Reporter: Matthias Pohl


We're currently using {{read-all}} for our workflows. We might want to limit 
the scope and document why certain reads are needed (see [GHA 
docs|https://docs.github.com/en/actions/using-jobs/assigning-permissions-to-jobs]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34331) Enable Apache INFRA runners for nightly builds

2024-02-01 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34331:
-

 Summary: Enable Apache INFRA runners for nightly builds
 Key: FLINK-34331
 URL: https://issues.apache.org/jira/browse/FLINK-34331
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0
Reporter: Matthias Pohl


The nightly CI is currently still utilizing the GitHub runners. We want to 
switch to Apache INFRA runners or ephemeral runners.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34330) Specify code owners for .github/workflows folder

2024-02-01 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34330:
-

 Summary: Specify code owners for .github/workflows folder
 Key: FLINK-34330
 URL: https://issues.apache.org/jira/browse/FLINK-34330
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.18.1, 1.19.0
Reporter: Matthias Pohl


Currently, the workflow files can be modified by any committer. We have to 
discuss whether we want to limit access to the PMC (or a subset of it) here. 
That might be a means to protect self-hosted runners.

See the [codeowner 
documentation|https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners]
 for further details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34326) Add Slack integration

2024-01-31 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34326:
-

 Summary: Add Slack integration
 Key: FLINK-34326
 URL: https://issues.apache.org/jira/browse/FLINK-34326
 Project: Flink
  Issue Type: Sub-task
Reporter: Matthias Pohl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34324) s3_setup is called in test_file_sink.sh even if the common_s3.sh is not sourced

2024-01-31 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34324:
-

 Summary: s3_setup is called in test_file_sink.sh even if the 
common_s3.sh is not sourced
 Key: FLINK-34324
 URL: https://issues.apache.org/jira/browse/FLINK-34324
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hadoop Compatibility, Tests
Affects Versions: 1.18.1, 1.17.2, 1.19.0
Reporter: Matthias Pohl






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] FLIP-396: BETA GitHub Actions workflows in master and release-1.18

2024-01-31 Thread Matthias Pohl
Hi everyone,
I merged the changes related to FLIP-396 [1] into master and release-1.18.
This enables the Flink CI and nightly builds [2] for these two versions
(and any branches that are based on these changes). The GitHub Actions
(GHA) workflows are still in beta stage, i.e. we should still base our PR
merge decisions on Azure Pipelines and the Flink CI bot.

I want to underline that this was a group effort: Appreciations for
supporting this work should go to Chesnay and Nico Weidner. Feel free to
join and improve the current state of the workflows.

The GHA workflows are not triggered by PRs, yet (i.e. they are not
integrated into the PR UI). Only push events and manual triggering (i.e.
workflow_dispatch events) are enabled for now. This allows us to monitor
the load on the Apache INFRA runners for the next few days/weeks. We might
be required to do minor adjustments (e.g. timeouts).

Contributors who have the most-recent changes of master or release-1.18
included in their branch can also benefit from the new workflows: For
forks, you would rely on the runners provided by GitHub itself. The current
limit for runners in the GitHub free plan is 20 [4].

Nightly builds can be easily triggered through the GHA UI for your branch.
The general observation (at least for the GitHub-provided runners) is that
the CI runs are more sensitive towards test instabilities. But that can be
considered a good thing, I guess. :)

Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+to+test+GitHub+Actions+as+an+alternative+for+Flink's+current+Azure+CI+infrastructure
[2] https://github.com/apache/flink/actions
[3] https://issues.apache.org/jira/browse/FLINK-33924
[4]
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits

-- 

[image: Aiven] <https://www.aiven.io>

*Matthias Pohl*
Opensource Software Engineer, *Aiven*
matthias.p...@aiven.io|  +49 170 9869525
aiven.io <https://www.aiven.io>   |   <https://www.facebook.com/aivencloud>
  <https://www.linkedin.com/company/aiven/>   <https://twitter.com/aiven_io>
*Aiven Deutschland GmbH*
Alexanderufer 3-7, 10117 Berlin
Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
Amtsgericht Charlottenburg, HRB 209739 B


[jira] [Created] (FLINK-34322) Make secrets work in GitHub Action workflows

2024-01-31 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34322:
-

 Summary: Make secrets work in GitHub Action workflows
 Key: FLINK-34322
 URL: https://issues.apache.org/jira/browse/FLINK-34322
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Affects Versions: 1.18.1, 1.19.0
Reporter: Matthias Pohl


The secrets need to be handed over to Apache Infra to make them accessible in 
the nightly runs. We might have to do adaptations to the workflows as well 
because it wasn't tested in the previous stages of FLIP-396.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-34321) Make nightly trigger select the release branch automatically

2024-01-31 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-34321:
-

 Summary: Make nightly trigger select the release branch 
automatically
 Key: FLINK-34321
 URL: https://issues.apache.org/jira/browse/FLINK-34321
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / CI
Reporter: Matthias Pohl


Currently, GHA CI only works with master (i.e. 1.19) and {{{}release-1.18{}}}. 
After the release of 1.19, we could switch to automatically selecting the 
release branches analogously to what is done in the 
[flink-ci/git-repo-sync|https://github.com/flink-ci/git-repo-sync/blob/master/sync_repo.sh#L28]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [ANNOUNCE] Flink 1.19 feature freeze & sync summary on 01/30/2024

2024-01-30 Thread Matthias Pohl
Thanks for the update, Lincoln.

fyi: I merged FLINK-32684 (deprecating AkkaOptions) [1] since we agreed in
today's meeting that this change is still ok to go in.

The beta version of the GitHub Actions workflows (FLIP-396 [2]) are also
finalized (see related PRs for basic CI [3], nightly master [4] and nightly
scheduling [5]). I'd like to merge the changes before creating the
release-1.19 branch. That would enable us to see whether we miss anything
in the GHA workflows setup when creating a new release branch.

The changes are limited to a few CI scripts that are also used for Azure
Pipelines (see [3]). The majority of the changes are GHA-specific and
shouldn't affect the Azure Pipelines CI setup.

Therefore, I'm requesting the approval from the 1.19 release managers to go
ahead with merging the mentioned PRs [3, 4, 5].

Matthias


[1] https://issues.apache.org/jira/browse/FLINK-32684
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+to+test+GitHub+Actions+as+an+alternative+for+Flink%27s+current+Azure+CI+infrastructure
[3] https://github.com/apache/flink/pull/23970
[4] https://github.com/apache/flink/pull/23971
[5] https://github.com/apache/flink/pull/23972

On Tue, Jan 30, 2024 at 1:51 PM Lincoln Lee  wrote:

> Hi everyone,
>
> (Since feature freeze and release sync are on the same day, we merged the
> announcement and sync summary together)
>
>
> *- Feature freeze*
> The feature freeze of 1.19 has started now. That means that no new features
> or improvements should now be merged into the master branch unless you ask
> the release managers first, which has already been done for PRs, or pending
> on CI to pass. Bug fixes and documentation PRs can still be merged.
>
>
> *- Cutting release branch*
> Currently we have three blocker issues[1][2][3], and will try to close
> them this Friday.
> We are planning to cut the release branch on next Monday (Feb 6th) if no
> new test instabilities,
> and we'll make another announcement in the dev mailing list then.
>
>
> *- Cross-team testing*
> Release testing is expected to start next week as soon as we cut the
> release branch.
> As a prerequisite, please Before we start testing, please make sure
> 1. Whether the feature needs a cross-team testing
> 2. If yes, please the documentation completed
> There's an umbrella ticket[4] for tracking the 1.19 testing, RM will
> create all tickets for completed features listed on the 1.19 wiki page[5]
> and assign to the feature's Responsible Contributor,
> also contributors are encouraged to create tickets following the steps in
> the umbrella ticket if there are other ones that need to be cross-team
> tested.
>
> *- Release notes*
>
> All new features and behavior changes require authors to fill out the
> 'Release Note' column in the JIRA(click the Edit button and pull the page
> to the center),
> especially since 1.19 involves a lot of deprecation, which is important
> for users and will be part of the release announcement.
>
> - *Sync meeting* (https://meet.google.com/vcx-arzs-trv)
>
> We've already switched to weekly release sync, so the next release sync
> will be on Feb 6th, 2024. Feel free to join us!
>
> [1] https://issues.apache.org/jira/browse/FLINK-34148
> [2] https://issues.apache.org/jira/browse/FLINK-34007
> [3] https://issues.apache.org/jira/browse/FLINK-34259
> [4] https://issues.apache.org/jira/browse/FLINK-34285
> [5] https://cwiki.apache.org/confluence/display/FLINK/1.19+Release
>
> Best,
> Yun, Jing, Martijn and Lincoln
>


  1   2   3   4   5   6   7   8   9   10   >