date:20240129

[jira] [Created] (FLINK-34272) AdaptiveSchedulerClusterITCase failure due to MiniCluster not running

2024-01-29 Thread Matthias Pohl (Jira)

Matthias Pohl created FLINK-34272:
-

 Summary: AdaptiveSchedulerClusterITCase failure due to MiniCluster 
not running
 Key: FLINK-34272
 URL: https://issues.apache.org/jira/browse/FLINK-34272
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.19.0
Reporter: Matthias Pohl


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57073=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=9543]
{code:java}
 Jan 29 17:21:29 17:21:29.465 [ERROR] Tests run: 3, Failures: 0, Errors: 2, 
Skipped: 0, Time elapsed: 12.48 s <<< FAILURE! -- in 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase
Jan 29 17:21:29 17:21:29.465 [ERROR] 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testAutomaticScaleUp
 -- Time elapsed: 8.599 s <<< ERROR!
Jan 29 17:21:29 java.lang.IllegalStateException: MiniCluster is not yet running 
or has already been shut down.
Jan 29 17:21:29 at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1118)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:991)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.getArchivedExecutionGraph(MiniCluster.java:840)
Jan 29 17:21:29 at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.lambda$waitUntilParallelismForVertexReached$3(AdaptiveSchedulerClusterITCase.java:270)
Jan 29 17:21:29 at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:151)
Jan 29 17:21:29 at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
Jan 29 17:21:29 at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.waitUntilParallelismForVertexReached(AdaptiveSchedulerClusterITCase.java:265)
Jan 29 17:21:29 at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testAutomaticScaleUp(AdaptiveSchedulerClusterITCase.java:146)
Jan 29 17:21:29 at java.lang.reflect.Method.invoke(Method.java:498)
Jan 29 17:21:29 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Jan 29 17:21:29 
Jan 29 17:21:29 17:21:29.466 [ERROR] 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale
 -- Time elapsed: 2.036 s <<< ERROR!
Jan 29 17:21:29 java.lang.IllegalStateException: MiniCluster is not yet running 
or has already been shut down.
Jan 29 17:21:29 at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1118)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:991)
Jan 29 17:21:29 at 
org.apache.flink.runtime.minicluster.MiniCluster.getExecutionGraph(MiniCluster.java:969)
Jan 29 17:21:29 at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.lambda$testCheckpointStatsPersistedAcrossRescale$1(AdaptiveSchedulerClusterITCase.java:183)
Jan 29 17:21:29 at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:151)
Jan 29 17:21:29 at 
org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
Jan 29 17:21:29 at 
org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:180)
Jan 29 17:21:29 at java.lang.reflect.Method.invoke(Method.java:498)
Jan 29 17:21:29 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [VOTE] Release flink-connector-kafka v3.1.0, release candidate #1

2024-01-29 Thread Mason Chen

+1 (non-binding)

* Verified LICENSE and NOTICE files (this RC has a NOTICE file that points
to 2023 that has since been updated on the main branch by Hang)
* Verified hashes and signatures
* Verified no binaries
* Verified poms point to 3.1.0
* Reviewed web PR
* Built from source
* Verified git tag

In the same vein as the web PR, do we want to prepare the PR to update the
shortcode in the connector docs now [1]? Same for the Chinese version. I
wonder if that should be included in the connector release instructions.

[1]
https://github.com/apache/flink-connector-kafka/blob/d89a082180232bb79e3c764228c4e7dbb9eb6b8b/docs/content/docs/connectors/datastream/kafka.md#L39

Best,
Mason

On Sun, Jan 28, 2024 at 11:41 PM Hang Ruan  wrote:

> +1 (non-binding)
>
> - Validated checksum hash
> - Verified signature
> - Verified that no binaries exist in the source archive
> - Build the source with Maven and jdk11
> - Verified web PR
> - Check that the jar is built by jdk8
>
> Best,
> Hang
>
> Martijn Visser  于2024年1月26日周五 21:05写道：
>
> > Hi everyone,
> > Please review and vote on the release candidate #1 for the Flink Kafka
> > connector version 3.1.0, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > This release is compatible with Flink 1.17.* and Flink 1.18.*
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> > [2],
> > which are signed with the key with fingerprint
> > A5F3BCE4CBE993573EC5966A65321B8382B219AF [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag v3.1.0-rc1 [5],
> > * website pull request listing the new release [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353135
> > [2]
> >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-kafka-3.1.0-rc1
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1700
> > [5]
> > https://github.com/apache/flink-connector-kafka/releases/tag/v3.1.0-rc1
> > [6] https://github.com/apache/flink-web/pull/718
> >
>

Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment

2024-01-29 Thread weijie guo

Thanks Dong and Xuannan,

The updated FLIP LGTM.


Best regards,

Weijie


Xuannan Su  于2024年1月30日周二 11:10写道：

> Hi all,
>
> Thanks for the comments and suggestions. If there are no further
> comments, I will open the voting thread tomorrow.
>
> Best regards,
> Xuannan
> On Fri, Jan 26, 2024 at 3:16 PM Dong Lin  wrote:
> >
> > Thanks Xuannan for the update!
> >
> > +1 (binding)
> >
> > On Wed, Jan 10, 2024 at 5:54 PM Xuannan Su 
> wrote:
> >
> > > Hi all,
> > >
> > > After several rounds of offline discussions with Xingtong and Jinhao,
> > > we have decided to narrow the scope of the FLIP. It will now focus on
> > > introducing OperatorAttributes that indicate whether an operator emits
> > > records only after inputs have ended. We will also use the attribute
> > > to optimize task scheduling for better resource utilization. Setting
> > > the backlog status and optimizing the operator implementation during
> > > the backlog will be deferred to future work.
> > >
> > > In addition to the change above, we also make the following changes to
> > > the FLIP to address the problems mentioned by Dong:
> > > - Public interfaces are updated to reuse the GlobalWindows.
> > > - Instead of making all outputs of the upstream operators of the
> > > "isOutputOnlyAfterEndOfStream=true" operator blocking, we only make
> > > the output of the operator with "isOutputOnlyAfterEndOfStream=true"
> > > blocking. This can prevent the second problem Dong mentioned. In the
> > > future, we may introduce an extra OperatorAttributes to indicate if an
> > > operator has any side output.
> > >
> > > I would greatly appreciate any comment or feedback you may have on the
> > > updated FLIP.
> > >
> > > Best regards,
> > > Xuannan
> > >
> > > On Tue, Sep 26, 2023 at 11:24 AM Dong Lin  wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Thanks for the review!
> > > >
> > > > Becket and I discussed this FLIP offline and we agreed on several
> things
> > > > that need to be improved with this FLIP. I will summarize our
> discussion
> > > > with the problems and TODOs. We will update the FLIP and let you know
> > > once
> > > > the FLIP is ready for review again.
> > > >
> > > > 1) Investigate whether it is possible to update the existing
> > > GlobalWindows
> > > > in a backward-compatible way and re-use it for the same purpose
> > > > as EndOfStreamWindows, without introducing EndOfStreamWindows as a
> new
> > > > class.
> > > >
> > > > Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger
> instance
> > > > which will not trigger window's computation even on end-of-inputs. We
> > > will
> > > > need to investigate its existing usage and see if we can re-use it
> in a
> > > > backward-compatible way.
> > > >
> > > > 2) Let JM know whether any operator in the upstream of the operator
> with
> > > > "isOutputOnEOF=true" will emit output via any side channel. The FLIP
> > > should
> > > > update the execution mode of those operators *only if* all outputs
> from
> > > > those operators are emitted only at the end of input.
> > > >
> > > > More specifically, the upstream operator might involve a user-defined
> > > > operator that might emit output directly to an external service,
> where
> > > the
> > > > emission operation is not explicitly expressed as an operator's
> output
> > > edge
> > > > and thus not visible to JM. Similarly, it is also possible for the
> > > > user-defined operator to register a timer
> > > > via InternalTimerService#registerEventTimeTimer and emit output to an
> > > > external service inside Triggerable#onEventTime. There is a chance
> that
> > > > users still need related logic to output data in real-time, even if
> the
> > > > downstream operators have isOutputOnEOF=true.
> > > >
> > > > One possible solution to address this problem is to add an extra
> > > > OperatorAttribute to specify whether this operator might output
> records
> > > in
> > > > such a way that does not go through operator's output (e.g. side
> output).
> > > > Then the JM can safely enable the runtime optimization currently
> > > described
> > > > in the FLIP when there is no such operator.
> > > >
> > > > 3) Create a follow-up FLIP that allows users to specify whether a
> source
> > > > with Boundedness=bounded should have isProcessingBacklog=true.
> > > >
> > > > This capability would effectively introduce a 3rd strategy to set
> backlog
> > > > status (in addition to FLIP-309 and FLIP-328). It might be useful to
> note
> > > > that, even though the data in bounded sources are backlog data in
> most
> > > > practical use-cases, it is not necessarily true. For example, users
> might
> > > > want to start a Flink job to consume real-time data from a Kafka
> topic
> > > and
> > > > specify that the job stops after 24 hours, which means the source is
> > > > technically bounded while the data is fresh/real-time.
> > > >
> > > > This capability is more generic and can cover more use-case than
> > > > EndOfStreamWindows. On the other hand,

Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2

2024-01-29 Thread weijie guo

Hi Wencong,

> Q1. In the "Configuration" section, it is mentioned that
configurations can be set continuously using the withXXX methods.
Are these configuration options the same as those provided by DataStream V1,
or might there be different options compared to V1?

I haven't considered options that don't exist in V1 yet, but we may have
some new options as we continue to develop.

> Q2. The FLIP describes the interface for handling processing
 timers (ProcessingTimeManager), but it does not mention
how to delete or update an existing timer. V1 API provides TimeService
that could delete a timer. Does this mean that
 once a timer is registered, it cannot be changed?

I think we do need to introduce a method to delete the timer, but I'm kind
of curious why we need to update the timer instead of registering a new
one. Anyway, I have updated the FLIP to support delete the timer.



Best regards,

Weijie


weijie guo  于2024年1月30日周二 14:35写道：

> Hi Xuannan,
>
> > 1. +1 to only use XXXParititionStream if users only need to use the
> configurable PartitionStream.  If there are use cases for both,
> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
> `ConfigurableNonKeyedPartitionStream` for simplicity.
>
> As for why we need both, you can refer to my reply to Yunfeng's first
> question. As for the name, I can accept
> ProcessConfigurableNonKeyedPartitionStream or keep the status quo. But I
> don't want to change it to ConfigurableNonKeyedPartitionStream, the reason
> is the same, because the configuration is applied to the Process rather
> than the Stream.
>
> > Should we allow users to set custom configurations through the
> `ProcessConfigurable` interface and access these configurations in the
> `ProcessFunction` via `RuntimeContext`? I believe it would be useful
> for process function developers to be able to define custom
> configurations.
>
> If I understand you correctly, you want to set custom properties for
> processing. The current configurations are mostly for the runtime engine,
> such as determining the underlying operator 's parallelism and SSG. But I'm
> not aware of the need to pass in a custom value(independent of the
> framework itself) and then get it at runtime from RuntimeContext. Could
> you give some examples?
>
> > How can users define custom metrics within the `ProcessFunction`?
> Will there be a method like `getMetricGroup` available in the
> `RuntimeContext`?
>
> I think this is a reasonable request. For extensibility, I have added the
> getMetricManager instead of getMetricGroup to RuntimeContext, we can use
> it to get the MetricGroup.
>
>
> Best regards,
>
> Weijie
>
>
> weijie guo  于2024年1月30日周二 13:45写道：
>
>> Thanks Yunfeng,
>>
>> Let me try to answer your question :)
>>
>> > 1. Would it be better to have all XXXPartitionStream classes implement
>> ProcessConfigurable, instead of defining both XXXPartitionStream and
>> ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
>> need to operate on a non-configurable PartitionStream.
>>
>> I thought about this for a while and decided to separate DataStream from
>> ProcessConfigurable. At the core of this is that streams and c
>> onfigurations are completely orthogonal concepts, and configuration is
>> only responsible for the `Process`, not the `Stream`. This is why only
>> the `process/connectAndProcess` returns configurable stream, but
>> partitioning like `KeyBy` returns a pure DataStream. This may also answer
>> your second question in passing.
>>
>>
>> > Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
>> methods, would it be better to also add a general
>> withConfig(configKey, configValue) method to the ProcessConfigurable
>> interface? Adding a method for each configuration might harm the
>> readability and compatibility of configurations.
>>
>> Sorry, I may not fully understand this question. ProcessConfigurable
>> simply refers to the configuration of the Process, which can have the name,
>> parallelism, etc of the process. It's not actually the 
>> Configuratiion(Contains
>> a lot of ConfigOptions) that we usually talk about, but more like
>> `SingleOutputStreamOperator` in DataStream V1.
>>
>> Best regards,
>>
>> Weijie
>>
>>
>> Xuannan Su  于2024年1月29日周一 18:45写道：
>>
>>> Hi Weijie,
>>>
>>> Thanks for the FLIP! I have a few questions regarding the FLIP.
>>>
>>> 1. +1 to only use XXXParititionStream if users only need to use the
>>> configurable PartitionStream.  If there are use cases for both,
>>> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
>>> `ConfigurableNonKeyedPartitionStream` for simplicity.
>>>
>>> 2. Should we allow users to set custom configurations through the
>>> `ProcessConfigurable` interface and access these configurations in the
>>> `ProcessFunction` via `RuntimeContext`? I believe it would be useful
>>> for process function developers to be able to define custom
>>> configurations.
>>>
>>> 3. How can users define custom metrics within the

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread weijie guo

Hi Xintong,

Thanks for your reply.

> Does this mean if we want to support (KeyedStream, BroadcastStream) ->
(KeyedStream), we must make sure that no data can be output upon processing
records from the input BroadcastStream? That's probably a reasonable
limitation.

I think so, this is the restriction that has to be imposed in order to
avoid re-partition(i.e. shuffle).
If one just want to get a keyed-stream and don't care about the data
distribution, then explicit KeyBy partitioning works as expected.

> The problem is would this limitation be too implicit for the users to
understand.

Since we can't check for this limitation at compile time, if we were to add
support for this case, we would have to introduce additional runtime checks
to ensure program correctness. For now, I'm inclined not to support it, as
it's hard for users to understand this restriction unless we have something
better. And we can always add it later if we do realize there's a strong
demand for it.

> 1. I'd suggest renaming the method with timestamp to something like
`collectAndOverwriteTimestamp`. That might help users understand that they
don't always need to call this method, unless they explicitly want to
overwrite the timestamp.

Make sense, I have updated this FLIP toward this new method name.

> 2. While this method provides a way to set timestamps, how would users
read
timestamps from the records?

Ah, good point. I will introduce a new method to get the timestamp of the
current record in RuntimeContext.


Best regards,

Weijie


Xintong Song  于2024年1月30日周二 14:04写道：

> Just trying to understand.
>
> > Is there a particular reason we do not support a
> > `TwoInputProcessFunction` to combine a KeyedStream with a
> > BroadcastStream to result in a KeyedStream? There seems to be a valid
> > use case where a KeyedStream is enriched with a BroadcastStream and
> > returns a Stream that is partitioned in the same way.
>
>
> > The key point here is that if the returned stream is a KeyedStream, we
> > require that the partition of  input and output be the same. As for the
> > data on the broadcast edge, it will be broadcast to all parallelism, we
> > cannot keep the data partition consistent. For example, if a specific
> > record is sent to both SubTask1 and SubTask2, after processing, the
> > partition index calculated by the new KeySelector is `1`, then the data
> > distribution of SubTask2 has obviously changed.
>
>
> Does this mean if we want to support (KeyedStream, BroadcastStream) ->
> (KeyedStream), we must make sure that no data can be output upon processing
> records from the input BroadcastStream? That's probably a reasonable
> limitation. The problem is would this limitation be too implicit for the
> users to understand.
>
>
> > I noticed that there are two `collect` methods in the Collector,
> >
> > one with a timestamp and one without. Could you elaborate on the
> > differences between them? Additionally, in what use case would one use
> > the method that includes the timestamp?
> >
> >
> > That's a good question, and it's mostly used with time-related operators
> > such as Window. First, we want to give the process function the ability
> to
> > reset timestamps, which makes it more flexible than the original
> > API. Second, we don't want to take the timestamp extraction
> > operator/function as a base primitive, it's more like a high-level
> > extension. Therefore, the framework must provide this functionality.
> >
> >
> 1. I'd suggest renaming the method with timestamp to something like
> `collectAndOverwriteTimestamp`. That might help users understand that they
> don't always need to call this method, unless they explicitly want to
> overwrite the timestamp.
>
> 2. While this method provides a way to set timestamps, how would users read
> timestamps from the records?
>
>
> Best,
>
> Xintong
>
>
>
> On Tue, Jan 30, 2024 at 12:45 PM weijie guo 
> wrote:
>
> > Hi Xuannan,
> >
> > Thank you for your attention.
> >
> > > In the partitioning section, it says that "broadcast can only be
> > used as a side-input of other Inputs." Could you clarify what is meant
> > by "side-input"? If I understand correctly, it refer to one of the
> > inputs of the `TwoInputStreamProcessFunction`. If that is the case,
> > the term "side-input" may not be accurate.
> >
> > Yes, you got it right! I have rewrote this sentence to avoid
> > misunderstanding.
> >
> > > Is there a particular reason we do not support a
> > `TwoInputProcessFunction` to combine a KeyedStream with a
> > BroadcastStream to result in a KeyedStream? There seems to be a valid
> > use case where a KeyedStream is enriched with a BroadcastStream and
> > returns a Stream that is partitioned in the same way.
> >
> > The key point here is that if the returned stream is a KeyedStream, we
> > require that the partition of  input and output be the same. As for the
> > data on the broadcast edge, it will be broadcast to all parallelism, we
> > cannot keep the data partition

Re: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1

2024-01-29 Thread Leonard Xu

Thanks Hang for the review!

 We should use jdk8 to compile the jar, I will cancel the rc1 and prepare rc2 
soon.

Best,
Leonard

> 2024年1月30日 下午2:39，Hang Ruan  写道：
> 
> Hi, Leonard.
> 
> I find that META-INF/MANIFEST.MF in
> flink-sql-connector-mongodb-1.1.0-1.18.jar shows as follow.
> 
> Manifest-Version: 1.0
> Archiver-Version: Plexus Archiver
> Created-By: Apache Maven 3.8.1
> Built-By: bangjiangxu
> Build-Jdk: 11.0.11
> Specification-Title: Flink : Connectors : SQL : MongoDB
> Specification-Version: 1.1.0-1.18
> Specification-Vendor: The Apache Software Foundation
> Implementation-Title: Flink : Connectors : SQL : MongoDB
> Implementation-Version: 1.1.0-1.18
> Implementation-Vendor-Id: org.apache.flink
> Implementation-Vendor: The Apache Software Foundation
> 
> Maybe we should build mongodb connector with jdk8.
> 
> Best,
> Hang
> 
> Jiabao Sun  于2024年1月29日周一 21:51写道：
> 
>> Thanks Leonard for driving this.
>> 
>> +1(non-binding)
>> 
>> - Release notes look good
>> - Tag is present in Github
>> - Validated checksum hash
>> - Verified signature
>> - Build the source with Maven by jdk8,11,17,21
>> - Verified web PR and left minor comments
>> - Run a filter push down test by sql-client on Flink 1.18.1 and it works
>> well
>> 
>> Best,
>> Jiabao
>> 
>> 
>> On 2024/01/29 12:33:23 Leonard Xu wrote:
>>> Hey all,
>>> 
>>> Please help review and vote on the release candidate #1 for the version
>> 1.1.0 of the
>>> Apache Flink MongoDB Connector as follows:
>>> 
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> 
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * The official Apache source release to be deployed to dist.apache.org
>> [2],
>>> which are signed with the key with fingerprint
>>> 5B2F6608732389AEB67331F5B197E1F1108998AD [3],
>>> * All artifacts to be deployed to the Maven Central Repository [4],
>>> * Source code tag v.1.0-rc1 [5],
>>> * Website pull request listing the new release [6].
>>> 
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>> 
>>> 
>>> Best,
>>> Leonard
>>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483
>>> [2]
>> https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/
>>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>>> [4]
>> https://repository.apache.org/content/repositories/orgapacheflink-1702/
>>> [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1
>>> [6] https://github.com/apache/flink-web/pull/719

Re: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1

2024-01-29 Thread Hang Ruan

Hi, Leonard.

I find that META-INF/MANIFEST.MF in
flink-sql-connector-mongodb-1.1.0-1.18.jar shows as follow.

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven 3.8.1
Built-By: bangjiangxu
Build-Jdk: 11.0.11
Specification-Title: Flink : Connectors : SQL : MongoDB
Specification-Version: 1.1.0-1.18
Specification-Vendor: The Apache Software Foundation
Implementation-Title: Flink : Connectors : SQL : MongoDB
Implementation-Version: 1.1.0-1.18
Implementation-Vendor-Id: org.apache.flink
Implementation-Vendor: The Apache Software Foundation

Maybe we should build mongodb connector with jdk8.

Best,
Hang

Jiabao Sun  于2024年1月29日周一 21:51写道：

> Thanks Leonard for driving this.
>
> +1(non-binding)
>
> - Release notes look good
> - Tag is present in Github
> - Validated checksum hash
> - Verified signature
> - Build the source with Maven by jdk8,11,17,21
> - Verified web PR and left minor comments
> - Run a filter push down test by sql-client on Flink 1.18.1 and it works
> well
>
> Best,
> Jiabao
>
>
> On 2024/01/29 12:33:23 Leonard Xu wrote:
> > Hey all,
> >
> > Please help review and vote on the release candidate #1 for the version
> 1.1.0 of the
> > Apache Flink MongoDB Connector as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * The official Apache source release to be deployed to dist.apache.org
> [2],
> > which are signed with the key with fingerprint
> > 5B2F6608732389AEB67331F5B197E1F1108998AD [3],
> > * All artifacts to be deployed to the Maven Central Repository [4],
> > * Source code tag v.1.0-rc1 [5],
> > * Website pull request listing the new release [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> > Best,
> > Leonard
> > [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483
> > [2]
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1702/
> > [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1
> > [6] https://github.com/apache/flink-web/pull/719

Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2

2024-01-29 Thread weijie guo

Hi Xuannan,

> 1. +1 to only use XXXParititionStream if users only need to use the
configurable PartitionStream.  If there are use cases for both,
perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
`ConfigurableNonKeyedPartitionStream` for simplicity.

As for why we need both, you can refer to my reply to Yunfeng's first
question. As for the name, I can accept
ProcessConfigurableNonKeyedPartitionStream or keep the status quo. But I
don't want to change it to ConfigurableNonKeyedPartitionStream, the reason
is the same, because the configuration is applied to the Process rather
than the Stream.

> Should we allow users to set custom configurations through the
`ProcessConfigurable` interface and access these configurations in the
`ProcessFunction` via `RuntimeContext`? I believe it would be useful
for process function developers to be able to define custom
configurations.

If I understand you correctly, you want to set custom properties for
processing. The
current configurations are mostly for the runtime engine, such as
determining the underlying operator 's parallelism and SSG. But I'm not
aware of the need to pass in a custom value(independent of the framework
itself) and then get it at runtime from RuntimeContext. Could you give some
examples?

> How can users define custom metrics within the `ProcessFunction`?
Will there be a method like `getMetricGroup` available in the
`RuntimeContext`?

I think this is a reasonable request. For extensibility, I have added the
getMetricManager instead of getMetricGroup to RuntimeContext, we can use it
to get the MetricGroup.


Best regards,

Weijie


weijie guo  于2024年1月30日周二 13:45写道：

> Thanks Yunfeng,
>
> Let me try to answer your question :)
>
> > 1. Would it be better to have all XXXPartitionStream classes implement
> ProcessConfigurable, instead of defining both XXXPartitionStream and
> ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
> need to operate on a non-configurable PartitionStream.
>
> I thought about this for a while and decided to separate DataStream from
> ProcessConfigurable. At the core of this is that streams and configuration
> s are completely orthogonal concepts, and configuration is only
> responsible for the `Process`, not the `Stream`. This is why only the `
> process/connectAndProcess` returns configurable stream, but partitioning
> like `KeyBy` returns a pure DataStream. This may also answer your second
> question in passing.
>
>
> > Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
> methods, would it be better to also add a general
> withConfig(configKey, configValue) method to the ProcessConfigurable
> interface? Adding a method for each configuration might harm the
> readability and compatibility of configurations.
>
> Sorry, I may not fully understand this question. ProcessConfigurable
> simply refers to the configuration of the Process, which can have the name,
> parallelism, etc of the process. It's not actually the Configuratiion(Contains
> a lot of ConfigOptions) that we usually talk about, but more like
> `SingleOutputStreamOperator` in DataStream V1.
>
> Best regards,
>
> Weijie
>
>
> Xuannan Su  于2024年1月29日周一 18:45写道：
>
>> Hi Weijie,
>>
>> Thanks for the FLIP! I have a few questions regarding the FLIP.
>>
>> 1. +1 to only use XXXParititionStream if users only need to use the
>> configurable PartitionStream.  If there are use cases for both,
>> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
>> `ConfigurableNonKeyedPartitionStream` for simplicity.
>>
>> 2. Should we allow users to set custom configurations through the
>> `ProcessConfigurable` interface and access these configurations in the
>> `ProcessFunction` via `RuntimeContext`? I believe it would be useful
>> for process function developers to be able to define custom
>> configurations.
>>
>> 3. How can users define custom metrics within the `ProcessFunction`?
>> Will there be a method like `getMetricGroup` available in the
>> `RuntimeContext`?
>>
>> Best,
>> Xuannan
>>
>>
>> On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou
>>  wrote:
>> >
>> > Hi Weijie,
>> >
>> > Thanks for introducing this FLIP! I have a few questions about the
>> > designs proposed.
>> >
>> > 1. Would it be better to have all XXXPartitionStream classes implement
>> > ProcessConfigurable, instead of defining both XXXPartitionStream and
>> > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
>> > need to operate on a non-configurable PartitionStream.
>> >
>> > 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will
>> > there be classes other than XXXPartitionStream that implement this
>> > interface? Will "Process" be accurate enough to describe
>> > PartitionStream and those classes?
>> >
>> > 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
>> > methods, would it be better to also add a general
>> > withConfig(configKey, configValue) method to the ProcessConfigurable
>> > interface?

Re: [VOTE] Release flink-connector-jdbc, release candidate #2

2024-01-29 Thread Hang Ruan

+1 (non-binding)

- Validated checksum hash
- Verified signature
- Verified that no binaries exist in the source archive
- Build the source with Maven and jdk11
- Verified web PR
- Check that the jar is built by jdk8

Best,
Hang

Jiabao Sun  于2024年1月30日周二 10:52写道：

> +1(non-binding)
>
> - Release notes look good
> - Tag is present in Github
> - Validated checksum hash
> - Verified signature
> - Verified web PR and left minor comments
>
> Best,
> Jiabao
>
>
> On 2024/01/30 00:17:54 Sergey Nuyanzin wrote:
> > Hi everyone,
> > Please review and vote on the release candidate #2 for the version
> > 3.1.2, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x.
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> > [2], which are signed with the key with fingerprint
> > 1596BBF0726835D8 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag v3.1.2-rc2 [5],
> > * website pull request listing the new release [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088
> > [2]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1704/
> > [5]
> https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2
> > [6] https://github.com/apache/flink-web/pull/707
> >

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread Xintong Song

Just trying to understand.

> Is there a particular reason we do not support a
> `TwoInputProcessFunction` to combine a KeyedStream with a
> BroadcastStream to result in a KeyedStream? There seems to be a valid
> use case where a KeyedStream is enriched with a BroadcastStream and
> returns a Stream that is partitioned in the same way.


> The key point here is that if the returned stream is a KeyedStream, we
> require that the partition of  input and output be the same. As for the
> data on the broadcast edge, it will be broadcast to all parallelism, we
> cannot keep the data partition consistent. For example, if a specific
> record is sent to both SubTask1 and SubTask2, after processing, the
> partition index calculated by the new KeySelector is `1`, then the data
> distribution of SubTask2 has obviously changed.


Does this mean if we want to support (KeyedStream, BroadcastStream) ->
(KeyedStream), we must make sure that no data can be output upon processing
records from the input BroadcastStream? That's probably a reasonable
limitation. The problem is would this limitation be too implicit for the
users to understand.


> I noticed that there are two `collect` methods in the Collector,
>
> one with a timestamp and one without. Could you elaborate on the
> differences between them? Additionally, in what use case would one use
> the method that includes the timestamp?
>
>
> That's a good question, and it's mostly used with time-related operators
> such as Window. First, we want to give the process function the ability to
> reset timestamps, which makes it more flexible than the original
> API. Second, we don't want to take the timestamp extraction
> operator/function as a base primitive, it's more like a high-level
> extension. Therefore, the framework must provide this functionality.
>
>
1. I'd suggest renaming the method with timestamp to something like
`collectAndOverwriteTimestamp`. That might help users understand that they
don't always need to call this method, unless they explicitly want to
overwrite the timestamp.

2. While this method provides a way to set timestamps, how would users read
timestamps from the records?


Best,

Xintong



On Tue, Jan 30, 2024 at 12:45 PM weijie guo 
wrote:

> Hi Xuannan,
>
> Thank you for your attention.
>
> > In the partitioning section, it says that "broadcast can only be
> used as a side-input of other Inputs." Could you clarify what is meant
> by "side-input"? If I understand correctly, it refer to one of the
> inputs of the `TwoInputStreamProcessFunction`. If that is the case,
> the term "side-input" may not be accurate.
>
> Yes, you got it right! I have rewrote this sentence to avoid
> misunderstanding.
>
> > Is there a particular reason we do not support a
> `TwoInputProcessFunction` to combine a KeyedStream with a
> BroadcastStream to result in a KeyedStream? There seems to be a valid
> use case where a KeyedStream is enriched with a BroadcastStream and
> returns a Stream that is partitioned in the same way.
>
> The key point here is that if the returned stream is a KeyedStream, we
> require that the partition of  input and output be the same. As for the
> data on the broadcast edge, it will be broadcast to all parallelism, we
> cannot keep the data partition consistent. For example, if a specific
> record is sent to both SubTask1 and SubTask2, after processing, the
> partition index calculated by the new KeySelector is `1`, then the data
> distribution of SubTask2 has obviously changed.
>
> > 3. There appears to be a typo in the example code. The
> `SingleStreamProcessFunction` should probably be
> `OneInputStreamProcessFunction`.
>
> Yes, good catch. I have updated this FLIP.
>
> > 4. How do we set the global configuration for the
> ExecutionEnvironment? Currently, we have the
> StreamExecutionEnvironment.getExecutionEnvironment(Configuration)
> method to provide the global configuration in the API.
>
> This is because we don't want to allow set config programmatically in the
> new API, everything best comes from configuration files. However, this may
> be too ideal, and the specific details need to be considered and discussed
> in more detail, and I propose to devote a new sub-FLIP to this issue later.
> We can easily provide the `getExecutionEnvironment(Configuration)` or
> `withConfiguration(Configuration)` method later.
>
> > I noticed that there are two `collect` methods in the Collector,
> one with a timestamp and one without. Could you elaborate on the
> differences between them? Additionally, in what use case would one use
> the method that includes the timestamp?
>
> That's a good question, and it's mostly used with time-related operators
> such as Window. First, we want to give the process function the ability to
> reset timestamps, which makes it more flexible than the original
> API. Second, we don't want to take the timestamp extraction
> operator/function as a base primitive, it's more like a high-level
> extension. Therefore, the framework

[jira] [Created] (FLINK-34271) Fix the unstable test about GroupAggregateRestoreTest#AGG_WITH_STATE_TTL_HINT

2024-01-29 Thread xuyang (Jira)

xuyang created FLINK-34271:
--

 Summary: Fix the unstable test about 
GroupAggregateRestoreTest#AGG_WITH_STATE_TTL_HINT
 Key: FLINK-34271
 URL: https://issues.apache.org/jira/browse/FLINK-34271
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: xuyang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34270) Update connector developer-facing doc

2024-01-29 Thread Zhanghao Chen (Jira)

Zhanghao Chen created FLINK-34270:
-

 Summary: Update connector developer-facing doc
 Key: FLINK-34270
 URL: https://issues.apache.org/jira/browse/FLINK-34270
 Project: Flink
  Issue Type: Sub-task
Reporter: Zhanghao Chen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2

2024-01-29 Thread weijie guo

Thanks Yunfeng,

Let me try to answer your question :)

> 1. Would it be better to have all XXXPartitionStream classes implement
ProcessConfigurable, instead of defining both XXXPartitionStream and
ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
need to operate on a non-configurable PartitionStream.

I thought about this for a while and decided to separate DataStream from
ProcessConfigurable. At the core of this is that streams and configurations are
completely orthogonal concepts, and configuration is only responsible for
the `Process`, not the `Stream`. This is why only the `
process/connectAndProcess` returns configurable stream, but partitioning
like `KeyBy` returns a pure DataStream. This may also answer your second
question in passing.


> Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
methods, would it be better to also add a general
withConfig(configKey, configValue) method to the ProcessConfigurable
interface? Adding a method for each configuration might harm the
readability and compatibility of configurations.

Sorry, I may not fully understand this question. ProcessConfigurable simply
refers to the configuration of the Process, which can have the name,
parallelism, etc of the process. It's not actually the Configuratiion(Contains
a lot of ConfigOptions) that we usually talk about, but more like
`SingleOutputStreamOperator` in DataStream V1.

Best regards,

Weijie


Xuannan Su  于2024年1月29日周一 18:45写道：

> Hi Weijie,
>
> Thanks for the FLIP! I have a few questions regarding the FLIP.
>
> 1. +1 to only use XXXParititionStream if users only need to use the
> configurable PartitionStream.  If there are use cases for both,
> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
> `ConfigurableNonKeyedPartitionStream` for simplicity.
>
> 2. Should we allow users to set custom configurations through the
> `ProcessConfigurable` interface and access these configurations in the
> `ProcessFunction` via `RuntimeContext`? I believe it would be useful
> for process function developers to be able to define custom
> configurations.
>
> 3. How can users define custom metrics within the `ProcessFunction`?
> Will there be a method like `getMetricGroup` available in the
> `RuntimeContext`?
>
> Best,
> Xuannan
>
>
> On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou
>  wrote:
> >
> > Hi Weijie,
> >
> > Thanks for introducing this FLIP! I have a few questions about the
> > designs proposed.
> >
> > 1. Would it be better to have all XXXPartitionStream classes implement
> > ProcessConfigurable, instead of defining both XXXPartitionStream and
> > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
> > need to operate on a non-configurable PartitionStream.
> >
> > 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will
> > there be classes other than XXXPartitionStream that implement this
> > interface? Will "Process" be accurate enough to describe
> > PartitionStream and those classes?
> >
> > 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
> > methods, would it be better to also add a general
> > withConfig(configKey, configValue) method to the ProcessConfigurable
> > interface? Adding a method for each configuration might harm the
> > readability and compatibility of configurations.
> >
> > Looking forward to your response.
> >
> > Best regards,
> > Yunfeng Zhou
> >
> > On Tue, Dec 26, 2023 at 2:47 PM weijie guo 
> wrote:
> > >
> > > Hi devs,
> > >
> > >
> > > I'd like to start a discussion about FLIP-410: Config, Context and
> > > Processing Timer Service of DataStream API V2 [1]. This is the second
> > > sub-FLIP of DataStream API V2.
> > >
> > >
> > > In FLIP-409 [2], we have defined the most basic primitive of
> > > DataStream V2. On this basis, this FLIP will further answer several
> > > important questions closely related to it:
> > >
> > >1.
> > >How to configure the processing over the datastreams, such as
> > > setting the parallelism.
> > >2.
> > >How to get access to the runtime contextual information and
> > > services from inside the process functions.
> > >3. How to work with processing-time timers.
> > >
> > > You can find more details in this FLIP. Its relationship with other
> > > sub-FLIPs can be found in the umbrella FLIP
> > > [3].
> > >
> > >
> > > Looking forward to hearing from you, thanks!
> > >
> > >
> > > Best regards,
> > >
> > > Weijie
> > >
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2
> > >
> > > [2]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction
> > >
> > > [3]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
>

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread weijie guo

Hi Xuannan,

Thank you for your attention.

> In the partitioning section, it says that "broadcast can only be
used as a side-input of other Inputs." Could you clarify what is meant
by "side-input"? If I understand correctly, it refer to one of the
inputs of the `TwoInputStreamProcessFunction`. If that is the case,
the term "side-input" may not be accurate.

Yes, you got it right! I have rewrote this sentence to avoid
misunderstanding.

> Is there a particular reason we do not support a
`TwoInputProcessFunction` to combine a KeyedStream with a
BroadcastStream to result in a KeyedStream? There seems to be a valid
use case where a KeyedStream is enriched with a BroadcastStream and
returns a Stream that is partitioned in the same way.

The key point here is that if the returned stream is a KeyedStream, we
require that the partition of  input and output be the same. As for the
data on the broadcast edge, it will be broadcast to all parallelism, we
cannot keep the data partition consistent. For example, if a specific
record is sent to both SubTask1 and SubTask2, after processing, the
partition index calculated by the new KeySelector is `1`, then the data
distribution of SubTask2 has obviously changed.

> 3. There appears to be a typo in the example code. The
`SingleStreamProcessFunction` should probably be
`OneInputStreamProcessFunction`.

Yes, good catch. I have updated this FLIP.

> 4. How do we set the global configuration for the
ExecutionEnvironment? Currently, we have the
StreamExecutionEnvironment.getExecutionEnvironment(Configuration)
method to provide the global configuration in the API.

This is because we don't want to allow set config programmatically in the
new API, everything best comes from configuration files. However, this may
be too ideal, and the specific details need to be considered and discussed
in more detail, and I propose to devote a new sub-FLIP to this issue later.
We can easily provide the `getExecutionEnvironment(Configuration)` or
`withConfiguration(Configuration)` method later.

> I noticed that there are two `collect` methods in the Collector,
one with a timestamp and one without. Could you elaborate on the
differences between them? Additionally, in what use case would one use
the method that includes the timestamp?

That's a good question, and it's mostly used with time-related operators
such as Window. First, we want to give the process function the ability to
reset timestamps, which makes it more flexible than the original
API. Second, we don't want to take the timestamp extraction
operator/function as a base primitive, it's more like a high-level
extension. Therefore, the framework must provide this functionality.


Best regards,

Weijie


weijie guo  于2024年1月30日周二 11:45写道：

> Hi Yunfeng,
>
> Thank you for your attention
>
> > 1. Will we provide any API to support choosing which input to consume
> between the two inputs of TwoInputStreamProcessFunction? It would be
> helpful in online machine learning cases, where a process function
> needs to receive the first machine learning model before it can start
> predictions on input data. Similar requirements might also exist in
> Flink CEP, where a rule set needs to be consumed by the process
> function first before it can start matching the event stream against
> CEP patterns.
>
> Good point! I think we can provide a `nextInputSelection()` method for
> `TwoInputStreamProcessFunction`.  It returns a ·First/Second· enum that
> determines which Input the mailbox thread will read next. But I'm
> considering putting it in the sub-FLIP related to Join, since features like
> HashJoin have a more specific need for this.
>
> > A typo might exist in the current FLIP describing the API to
> generate a global stream, as I can see either global() or coalesce()
> in different places of the FLIP. These two methods might need to be
> unified into one method.
>
> Good catch! I have updated this FLIP to fix this typo.
>
> > The order of parameters in the current ProcessFunction is (record,
> context, output), while this FLIP proposes to change the order into
> (record, output, context). Is there any reason to make this change?
>
> No, it's just the order we decide. But please note that there is no
> relationship between the two ProcessFunction's anyway. I think it's okay to
> use our own order of parameters in new API.
>
> 4. Why does this FLIP propose to use connectAndProcess() instead of
> connect() (+ keyBy()) + process()? The latter looks simpler to me.
>
> > I actually also considered this way at first, but it would have to
> introduce some concepts like ConnectedStreams. But we hope that streams
> will be more clearly defined in the DataStream API, otherwise we will end
> up going the same way as the original API, which you have to understand
> `JoinedStreams/ConnectedStreams` and so on.
>
>
>
> Best regards,
>
> Weijie
>
>
> weijie guo  于2024年1月30日周二 11:20写道：
>
>> Hi Wencong:
>>
>> Thank you for your attention
>>
>> > Q1. Other DataStream

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread weijie guo

Hi Yunfeng,

Thank you for your attention

> 1. Will we provide any API to support choosing which input to consume
between the two inputs of TwoInputStreamProcessFunction? It would be
helpful in online machine learning cases, where a process function
needs to receive the first machine learning model before it can start
predictions on input data. Similar requirements might also exist in
Flink CEP, where a rule set needs to be consumed by the process
function first before it can start matching the event stream against
CEP patterns.

Good point! I think we can provide a `nextInputSelection()` method for
`TwoInputStreamProcessFunction`.  It returns a ·First/Second· enum that
determines which Input the mailbox thread will read next. But I'm
considering putting it in the sub-FLIP related to Join, since features like
HashJoin have a more specific need for this.

> A typo might exist in the current FLIP describing the API to
generate a global stream, as I can see either global() or coalesce()
in different places of the FLIP. These two methods might need to be
unified into one method.

Good catch! I have updated this FLIP to fix this typo.

> The order of parameters in the current ProcessFunction is (record,
context, output), while this FLIP proposes to change the order into
(record, output, context). Is there any reason to make this change?

No, it's just the order we decide. But please note that there is no
relationship between the two ProcessFunction's anyway. I think it's okay to
use our own order of parameters in new API.

4. Why does this FLIP propose to use connectAndProcess() instead of
connect() (+ keyBy()) + process()? The latter looks simpler to me.

> I actually also considered this way at first, but it would have to
introduce some concepts like ConnectedStreams. But we hope that streams
will be more clearly defined in the DataStream API, otherwise we will end
up going the same way as the original API, which you have to understand
`JoinedStreams/ConnectedStreams` and so on.



Best regards,

Weijie


weijie guo  于2024年1月30日周二 11:20写道：

> Hi Wencong:
>
> Thank you for your attention
>
> > Q1. Other DataStream types are converted into
> Non-Keyed DataStreams by using a "shuffle" operation
> to convert Input into output. Does this "shuffle" include the
> various repartition operations (rebalance/rescale/shuffle)
> from DataStream V1?
>
> Yes, The name `shuffle` is used only to represent the transformation of an
> arbitrary stream into a non-keyed partitioned stream and does not restrict
> how the data is partitioned.
>
>
> > Q2. Why is the design for TwoOutputStreamProcessFunction,
> when dealing with a KeyedStream, only outputting combinations
> of (Keyed + Keyed) and (Non-Keyed + Non-Keyed)?
>
> In theory, we could only provide functions that return Non-Keyed streams.
> If you do want a KeyedStream, you explicitly convert it to a KeyedStream
> via keyBy. However, because sometimes data is processed without changing
> the partition, we choose to provide an additional KeyedStream counterpart
> to reduce the shuffle overhead. We didn't introduce the non-keyed + keyed
> combo here simply because it's not very common, and if we really see a lot
> of users asking for it later on, it's easy to support it then.
>
>
> Best regards,
>
> Weijie
>
>
> Xuannan Su  于2024年1月29日周一 18:28写道：
>
>> Hi Weijie,
>>
>> Thank you for driving the design of the new DataStream API. I have a
>> few questions regarding the FLIP:
>>
>> 1. In the partitioning section, it says that "broadcast can only be
>> used as a side-input of other Inputs." Could you clarify what is meant
>> by "side-input"? If I understand correctly, it refer to one of the
>> inputs of the `TwoInputStreamProcessFunction`. If that is the case,
>> the term "side-input" may not be accurate.
>>
>> 2. Is there a particular reason we do not support a
>> `TwoInputProcessFunction` to combine a KeyedStream with a
>> BroadcastStream to result in a KeyedStream? There seems to be a valid
>> use case where a KeyedStream is enriched with a BroadcastStream and
>> returns a Stream that is partitioned in the same way.
>>
>> 3. There appears to be a typo in the example code. The
>> `SingleStreamProcessFunction` should probably be
>> `OneInputStreamProcessFunction`.
>>
>> 4. How do we set the global configuration for the
>> ExecutionEnvironment? Currently, we have the
>> StreamExecutionEnvironment.getExecutionEnvironment(Configuration)
>> method to provide the global configuration in the API.
>>
>> 5. I noticed that there are two `collect` methods in the Collector,
>> one with a timestamp and one without. Could you elaborate on the
>> differences between them? Additionally, in what use case would one use
>> the method that includes the timestamp?
>>
>> Best regards,
>> Xuannan
>>
>>
>>
>> On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou
>>  wrote:
>> >
>> > Hi Weijie,
>> >
>> > Thanks for raising discussions about the new DataStream API. I have a
>> > few questions about the content of

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread weijie guo

Hi Wencong:

Thank you for your attention

> Q1. Other DataStream types are converted into
Non-Keyed DataStreams by using a "shuffle" operation
to convert Input into output. Does this "shuffle" include the
various repartition operations (rebalance/rescale/shuffle)
from DataStream V1?

Yes, The name `shuffle` is used only to represent the transformation of an
arbitrary stream into a non-keyed partitioned stream and does not restrict
how the data is partitioned.


> Q2. Why is the design for TwoOutputStreamProcessFunction,
when dealing with a KeyedStream, only outputting combinations
of (Keyed + Keyed) and (Non-Keyed + Non-Keyed)?

In theory, we could only provide functions that return Non-Keyed streams.
If you do want a KeyedStream, you explicitly convert it to a KeyedStream
via keyBy. However, because sometimes data is processed without changing
the partition, we choose to provide an additional KeyedStream counterpart
to reduce the shuffle overhead. We didn't introduce the non-keyed + keyed
combo here simply because it's not very common, and if we really see a lot
of users asking for it later on, it's easy to support it then.


Best regards,

Weijie


Xuannan Su  于2024年1月29日周一 18:28写道：

> Hi Weijie,
>
> Thank you for driving the design of the new DataStream API. I have a
> few questions regarding the FLIP:
>
> 1. In the partitioning section, it says that "broadcast can only be
> used as a side-input of other Inputs." Could you clarify what is meant
> by "side-input"? If I understand correctly, it refer to one of the
> inputs of the `TwoInputStreamProcessFunction`. If that is the case,
> the term "side-input" may not be accurate.
>
> 2. Is there a particular reason we do not support a
> `TwoInputProcessFunction` to combine a KeyedStream with a
> BroadcastStream to result in a KeyedStream? There seems to be a valid
> use case where a KeyedStream is enriched with a BroadcastStream and
> returns a Stream that is partitioned in the same way.
>
> 3. There appears to be a typo in the example code. The
> `SingleStreamProcessFunction` should probably be
> `OneInputStreamProcessFunction`.
>
> 4. How do we set the global configuration for the
> ExecutionEnvironment? Currently, we have the
> StreamExecutionEnvironment.getExecutionEnvironment(Configuration)
> method to provide the global configuration in the API.
>
> 5. I noticed that there are two `collect` methods in the Collector,
> one with a timestamp and one without. Could you elaborate on the
> differences between them? Additionally, in what use case would one use
> the method that includes the timestamp?
>
> Best regards,
> Xuannan
>
>
>
> On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou
>  wrote:
> >
> > Hi Weijie,
> >
> > Thanks for raising discussions about the new DataStream API. I have a
> > few questions about the content of the FLIP.
> >
> > 1. Will we provide any API to support choosing which input to consume
> > between the two inputs of TwoInputStreamProcessFunction? It would be
> > helpful in online machine learning cases, where a process function
> > needs to receive the first machine learning model before it can start
> > predictions on input data. Similar requirements might also exist in
> > Flink CEP, where a rule set needs to be consumed by the process
> > function first before it can start matching the event stream against
> > CEP patterns.
> >
> > 2. A typo might exist in the current FLIP describing the API to
> > generate a global stream, as I can see either global() or coalesce()
> > in different places of the FLIP. These two methods might need to be
> > unified into one method.
> >
> > 3. The order of parameters in the current ProcessFunction is (record,
> > context, output), while this FLIP proposes to change the order into
> > (record, output, context). Is there any reason to make this change?
> >
> > 4. Why does this FLIP propose to use connectAndProcess() instead of
> > connect() (+ keyBy()) + process()? The latter looks simpler to me.
> >
> > Looking forward to discussing these questions with you.
> >
> > Best regards,
> > Yunfeng Zhou
> >
> > On Tue, Dec 26, 2023 at 2:44 PM weijie guo 
> wrote:
> > >
> > > Hi devs,
> > >
> > >
> > > I'd like to start a discussion about FLIP-409: DataStream V2 Building
> > > Blocks: DataStream, Partitioning and ProcessFunction [1].
> > >
> > >
> > > As the first sub-FLIP for DataStream API V2, we'd like to discuss and
> > > try to answer some of the most fundamental questions in stream
> > > processing:
> > >
> > >1. What kinds of data streams do we have?
> > >2. How to partition data over the streams?
> > >3. How to define a processing on the data stream?
> > >
> > > The answer to these questions involve three core concepts: DataStream,
> > > Partitioning and ProcessFunction. In this FLIP, we will discuss the
> > > definitions and related API primitives of these concepts in detail.
> > >
> > >
> > > You can find more details in FLIP-409 [1]. This sub-FLIP is at the
> > > heart of the entire

Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment

2024-01-29 Thread Xuannan Su

Hi all,

Thanks for the comments and suggestions. If there are no further
comments, I will open the voting thread tomorrow.

Best regards,
Xuannan
On Fri, Jan 26, 2024 at 3:16 PM Dong Lin  wrote:
>
> Thanks Xuannan for the update!
>
> +1 (binding)
>
> On Wed, Jan 10, 2024 at 5:54 PM Xuannan Su  wrote:
>
> > Hi all,
> >
> > After several rounds of offline discussions with Xingtong and Jinhao,
> > we have decided to narrow the scope of the FLIP. It will now focus on
> > introducing OperatorAttributes that indicate whether an operator emits
> > records only after inputs have ended. We will also use the attribute
> > to optimize task scheduling for better resource utilization. Setting
> > the backlog status and optimizing the operator implementation during
> > the backlog will be deferred to future work.
> >
> > In addition to the change above, we also make the following changes to
> > the FLIP to address the problems mentioned by Dong:
> > - Public interfaces are updated to reuse the GlobalWindows.
> > - Instead of making all outputs of the upstream operators of the
> > "isOutputOnlyAfterEndOfStream=true" operator blocking, we only make
> > the output of the operator with "isOutputOnlyAfterEndOfStream=true"
> > blocking. This can prevent the second problem Dong mentioned. In the
> > future, we may introduce an extra OperatorAttributes to indicate if an
> > operator has any side output.
> >
> > I would greatly appreciate any comment or feedback you may have on the
> > updated FLIP.
> >
> > Best regards,
> > Xuannan
> >
> > On Tue, Sep 26, 2023 at 11:24 AM Dong Lin  wrote:
> > >
> > > Hi all,
> > >
> > > Thanks for the review!
> > >
> > > Becket and I discussed this FLIP offline and we agreed on several things
> > > that need to be improved with this FLIP. I will summarize our discussion
> > > with the problems and TODOs. We will update the FLIP and let you know
> > once
> > > the FLIP is ready for review again.
> > >
> > > 1) Investigate whether it is possible to update the existing
> > GlobalWindows
> > > in a backward-compatible way and re-use it for the same purpose
> > > as EndOfStreamWindows, without introducing EndOfStreamWindows as a new
> > > class.
> > >
> > > Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger instance
> > > which will not trigger window's computation even on end-of-inputs. We
> > will
> > > need to investigate its existing usage and see if we can re-use it in a
> > > backward-compatible way.
> > >
> > > 2) Let JM know whether any operator in the upstream of the operator with
> > > "isOutputOnEOF=true" will emit output via any side channel. The FLIP
> > should
> > > update the execution mode of those operators *only if* all outputs from
> > > those operators are emitted only at the end of input.
> > >
> > > More specifically, the upstream operator might involve a user-defined
> > > operator that might emit output directly to an external service, where
> > the
> > > emission operation is not explicitly expressed as an operator's output
> > edge
> > > and thus not visible to JM. Similarly, it is also possible for the
> > > user-defined operator to register a timer
> > > via InternalTimerService#registerEventTimeTimer and emit output to an
> > > external service inside Triggerable#onEventTime. There is a chance that
> > > users still need related logic to output data in real-time, even if the
> > > downstream operators have isOutputOnEOF=true.
> > >
> > > One possible solution to address this problem is to add an extra
> > > OperatorAttribute to specify whether this operator might output records
> > in
> > > such a way that does not go through operator's output (e.g. side output).
> > > Then the JM can safely enable the runtime optimization currently
> > described
> > > in the FLIP when there is no such operator.
> > >
> > > 3) Create a follow-up FLIP that allows users to specify whether a source
> > > with Boundedness=bounded should have isProcessingBacklog=true.
> > >
> > > This capability would effectively introduce a 3rd strategy to set backlog
> > > status (in addition to FLIP-309 and FLIP-328). It might be useful to note
> > > that, even though the data in bounded sources are backlog data in most
> > > practical use-cases, it is not necessarily true. For example, users might
> > > want to start a Flink job to consume real-time data from a Kafka topic
> > and
> > > specify that the job stops after 24 hours, which means the source is
> > > technically bounded while the data is fresh/real-time.
> > >
> > > This capability is more generic and can cover more use-case than
> > > EndOfStreamWindows. On the other hand, EndOfStreamWindows will still be
> > > useful in cases where users already need to specify this window assigner
> > in
> > > a DataStream program, without bothering users to decide whether it is
> > safe
> > > to treat data in a bounded source as backlog data.
> > >
> > >
> > > Regards,
> > > Dong
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Sep 18, 2023

Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2

2024-01-29 Thread Xintong Song

Thanks for working on this, Weijie.

The design flaws of the current DataStream API (i.e., V1) have been a pain
for a long time. It's great to see efforts going on trying to resolve them.

Significant changes to such an important and comprehensive set of public
APIs deserves caution. From that perspective, the ideas of introducing a
new set of APIs that gradually replace the current one, splitting the
introducing of the new APIs into many separate FLIPs, and making
intermediate APIs @Experiemental until all of them are completed make
great sense to me.

Besides, the ideas of generalized watermark, execution hints sound quite
interesting. Looking forward to more detailed discussions in the
corresponding sub-FLIPs.

+1 for the roadmap.

Best,

Xintong



On Tue, Jan 30, 2024 at 11:00 AM weijie guo 
wrote:

> Hi Wencong:
>
> > The Processing TimerService is currently
> defined as one of the basic primitives, partly because it's understood that
> you have to choose between processing time and event time.
> The other part of the reason is that it needs to work based on the task's
> mailbox thread model to avoid concurrency issues. Could you clarify the
> second
> part of the reason?
>
> Since the processing logic of the operators takes place in the mailbox
> thread, the processing timer's callback function must also be executed in
> the mailbox to ensure thread safety.
> If we do not define the Processing TimerService as primitive, there is no
> way for the user to dispatch custom logic to the mailbox thread.
>
>
> Best regards,
>
> Weijie
>
>
> Xuannan Su  于2024年1月29日周一 17:12写道：
>
> > Hi Weijie,
> >
> > Thanks for driving the work! There are indeed many pain points in the
> > current DataStream API, which are challenging to resolve with its
> > existing design. It is a great opportunity to propose a new DataStream
> > API that tackles these issues. I like the way we've divided the FLIP
> > into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1
> > for the umbrella FLIP. I am eager to see the sub-FLIPs!
> >
> > Best regards,
> > Xuannan
> >
> >
> >
> >
> > On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu 
> wrote:
> > >
> > > Hi Weijie,
> > >
> > >
> > > Thank you for the effort you've put into the DataStream API ! By
> > reorganizing and
> > > redesigning the DataStream API, as well as addressing some of the
> > unreasonable
> > > designs within it, we can enhance the efficiency of job development for
> > developers.
> > > It also allows developers to design more flexible Flink jobs to meet
> > business requirements.
> > >
> > >
> > > I have conducted a comprehensive review of the DataStream API design in
> > versions
> > > 1.18 and 1.19. I found quite a few functional defects in the DataStream
> > API, such as the
> > > lack of corresponding APIs in batch processing scenarios. In the
> > upcoming 1.20 version,
> > > I will further improve the DataStream API in batch computing scenarios.
> > >
> > >
> > > The issues existing in the old DataStream API (which can be referred to
> > as V1) can be
> > > addressed from a design perspective in the initial version of V2. I
> hope
> > to also have the
> > >  opportunity to participate in the development of DataStream V2 and
> make
> > my contribution.
> > >
> > >
> > > Regarding FLIP-408, I have a question: The Processing TimerService is
> > currently
> > > defined as one of the basic primitives, partly because it's understood
> > that
> > > you have to choose between processing time and event time.
> > > The other part of the reason is that it needs to work based on the
> task's
> > > mailbox thread model to avoid concurrency issues. Could you clarify the
> > second
> > > part of the reason?
> > >
> > > Best,
> > > Wencong Liu
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2023-12-26 14:42:20, "weijie guo" 
> wrote:
> > > >Hi devs,
> > > >
> > > >
> > > >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce
> > > >DataStream API V2 [1].
> > > >
> > > >
> > > >The DataStream API is one of the two main APIs that Flink provides for
> > > >writing data processing programs. As an API that was introduced
> > > >practically since day-1 of the project and has been evolved for nearly
> > > >a decade, we are observing more and more problems of it. Improvements
> > > >on these problems require significant breaking changes, which makes
> > > >in-place refactor impractical. Therefore, we propose to introduce a
> > > >new set of APIs, the DataStream API V2, to gradually replace the
> > > >original DataStream API.
> > > >
> > > >
> > > >The proposal to introduce a whole set new API is complex and includes
> > > >massive changes. We are planning  to break it down into multiple
> > > >sub-FLIPs for incremental discussion. This FLIP is only used as an
> > > >umbrella, mainly focusing on motivation, goals, and overall planning.
> > > >That is to say, more design and implementation details  will be
> > > >discussed in

Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2

2024-01-29 Thread weijie guo

Hi Wencong:

> The Processing TimerService is currently
defined as one of the basic primitives, partly because it's understood that
you have to choose between processing time and event time.
The other part of the reason is that it needs to work based on the task's
mailbox thread model to avoid concurrency issues. Could you clarify the
second
part of the reason?

Since the processing logic of the operators takes place in the mailbox
thread, the processing timer's callback function must also be executed in
the mailbox to ensure thread safety.
If we do not define the Processing TimerService as primitive, there is no
way for the user to dispatch custom logic to the mailbox thread.


Best regards,

Weijie


Xuannan Su  于2024年1月29日周一 17:12写道：

> Hi Weijie,
>
> Thanks for driving the work! There are indeed many pain points in the
> current DataStream API, which are challenging to resolve with its
> existing design. It is a great opportunity to propose a new DataStream
> API that tackles these issues. I like the way we've divided the FLIP
> into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1
> for the umbrella FLIP. I am eager to see the sub-FLIPs!
>
> Best regards,
> Xuannan
>
>
>
>
> On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu  wrote:
> >
> > Hi Weijie,
> >
> >
> > Thank you for the effort you've put into the DataStream API ! By
> reorganizing and
> > redesigning the DataStream API, as well as addressing some of the
> unreasonable
> > designs within it, we can enhance the efficiency of job development for
> developers.
> > It also allows developers to design more flexible Flink jobs to meet
> business requirements.
> >
> >
> > I have conducted a comprehensive review of the DataStream API design in
> versions
> > 1.18 and 1.19. I found quite a few functional defects in the DataStream
> API, such as the
> > lack of corresponding APIs in batch processing scenarios. In the
> upcoming 1.20 version,
> > I will further improve the DataStream API in batch computing scenarios.
> >
> >
> > The issues existing in the old DataStream API (which can be referred to
> as V1) can be
> > addressed from a design perspective in the initial version of V2. I hope
> to also have the
> >  opportunity to participate in the development of DataStream V2 and make
> my contribution.
> >
> >
> > Regarding FLIP-408, I have a question: The Processing TimerService is
> currently
> > defined as one of the basic primitives, partly because it's understood
> that
> > you have to choose between processing time and event time.
> > The other part of the reason is that it needs to work based on the task's
> > mailbox thread model to avoid concurrency issues. Could you clarify the
> second
> > part of the reason?
> >
> > Best,
> > Wencong Liu
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2023-12-26 14:42:20, "weijie guo"  wrote:
> > >Hi devs,
> > >
> > >
> > >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce
> > >DataStream API V2 [1].
> > >
> > >
> > >The DataStream API is one of the two main APIs that Flink provides for
> > >writing data processing programs. As an API that was introduced
> > >practically since day-1 of the project and has been evolved for nearly
> > >a decade, we are observing more and more problems of it. Improvements
> > >on these problems require significant breaking changes, which makes
> > >in-place refactor impractical. Therefore, we propose to introduce a
> > >new set of APIs, the DataStream API V2, to gradually replace the
> > >original DataStream API.
> > >
> > >
> > >The proposal to introduce a whole set new API is complex and includes
> > >massive changes. We are planning  to break it down into multiple
> > >sub-FLIPs for incremental discussion. This FLIP is only used as an
> > >umbrella, mainly focusing on motivation, goals, and overall planning.
> > >That is to say, more design and implementation details  will be
> > >discussed in other FLIPs.
> > >
> > >
> > >Given that it's hard to imagine the detailed design of the new API if
> > >we're just talking about this umbrella FLIP, and we probably won't be
> > >able to give an opinion on it. Therefore, I have prepared two
> > >sub-FLIPs [2][3] at the same time, and the discussion of them will be
> > >posted later in separate threads.
> > >
> > >
> > >Looking forward to hearing from you, thanks!
> > >
> > >
> > >Best regards,
> > >
> > >Weijie
> > >
> > >
> > >
> > >[1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
> > >
> > >[2]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction
> > >
> > >
> > >[3]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2
>

RE: [VOTE] Release flink-connector-jdbc, release candidate #2

2024-01-29 Thread Jiabao Sun

+1(non-binding)

- Release notes look good
- Tag is present in Github
- Validated checksum hash
- Verified signature
- Verified web PR and left minor comments

Best,
Jiabao


On 2024/01/30 00:17:54 Sergey Nuyanzin wrote:
> Hi everyone,
> Please review and vote on the release candidate #2 for the version
> 3.1.2, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x.
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which are signed with the key with fingerprint
> 1596BBF0726835D8 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag v3.1.2-rc2 [5],
> * website pull request listing the new release [6].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> Release Manager
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088
> [2]
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1704/
> [5] https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2
> [6] https://github.com/apache/flink-web/pull/707
>

[jira] [Created] (FLINK-34269) Sync maven-shade-plugin across modules in flink-shaded

2024-01-29 Thread Sergey Nuyanzin (Jira)

Sergey Nuyanzin created FLINK-34269:
---

 Summary: Sync maven-shade-plugin across modules in flink-shaded 
 Key: FLINK-34269
 URL: https://issues.apache.org/jira/browse/FLINK-34269
 Project: Flink
  Issue Type: Technical Debt
  Components: BuildSystem / Shaded
Affects Versions: shaded-18.0
Reporter: Sergey Nuyanzin
Assignee: Sergey Nuyanzin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[VOTE] Release flink-connector-jdbc, release candidate #2

2024-01-29 Thread Sergey Nuyanzin

Hi everyone,
Please review and vote on the release candidate #2 for the version
3.1.2, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which are signed with the key with fingerprint
1596BBF0726835D8 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag v3.1.2-rc2 [5],
* website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088
[2]
https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1704/
[5] https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2
[6] https://github.com/apache/flink-web/pull/707

[jira] [Created] (FLINK-34268) Add a test to verify if restore test exists for ExecNode

2024-01-29 Thread Bonnie Varghese (Jira)

Bonnie Varghese created FLINK-34268:
---

 Summary: Add a test to verify if restore test exists for ExecNode
 Key: FLINK-34268
 URL: https://issues.apache.org/jira/browse/FLINK-34268
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Bonnie Varghese
Assignee: Bonnie Varghese






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34267) Python connector test fails when running on MacBook with m1 processor

2024-01-29 Thread Aleksandr Pilipenko (Jira)

Aleksandr Pilipenko created FLINK-34267:
---

 Summary: Python connector test fails when running on MacBook with 
m1 processor
 Key: FLINK-34267
 URL: https://issues.apache.org/jira/browse/FLINK-34267
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Common
 Environment: m1 MacBook Pro

MacOS 14.2.1
Reporter: Aleksandr Pilipenko


Attempt to execute lint_python.sh on m1 macbook fails while trying to install 
miniconda environment
{code}
=installing environment=
installing wget...
install wget... [SUCCESS]
installing miniconda...
download miniconda...
download miniconda... [SUCCESS]
installing conda...
tail: illegal offset -- +018838: Invalid argument
tail: illegal offset -- +018838: Invalid argument
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/download/miniconda.sh:
 line 353: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/preconda.tar.bz2:
 No such file or directory
upgrade pip...
./dev/lint-python.sh: line 215: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/python: 
No such file or directory
upgrade pip... [SUCCESS]
install conda ... [SUCCESS]
install miniconda... [SUCCESS]
installing python environment...
installing python3.7...
./dev/lint-python.sh: line 247: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No 
such file or directory
conda install 3.7 retrying 1/3
./dev/lint-python.sh: line 254: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No 
such file or directory
conda install 3.7 retrying 2/3
./dev/lint-python.sh: line 254: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No 
such file or directory
conda install 3.7 retrying 3/3
./dev/lint-python.sh: line 254: 
/Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No 
such file or directory
conda install 3.7 failed after retrying 3 times.You can retry to 
execute the script again.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [VOTE] Release flink-connector-parent, release candidate #1

2024-01-29 Thread Maximilian Michels

- Inspected the source for licenses and corresponding headers
- Checksums and signature OK

+1 (binding)

On Tue, Jan 23, 2024 at 4:08 PM Etienne Chauchot  wrote:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version
> 1.1.0, as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which are signed with the key with fingerprint
> D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag v1.1.0-rc1 [5],
> * website pull request listing the new release [6]
>
> * confluence wiki: connector parent upgrade to version 1.1.0 that will
> be validated after the artifact is released (there is no PR mechanism on
> the wiki) [7]
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Etienne
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353442
> [2]
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1698/
> [5]
> https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc1
> [6] https://github.com/apache/flink-web/pull/717
>
> [7]
> https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development

[jira] [Created] (FLINK-34266) Output ratios should be computed over the whole metric window instead of averaged

2024-01-29 Thread Gyula Fora (Jira)

Gyula Fora created FLINK-34266:
--

 Summary: Output ratios should be computed over the whole metric 
window instead of averaged
 Key: FLINK-34266
 URL: https://issues.apache.org/jira/browse/FLINK-34266
 Project: Flink
  Issue Type: Improvement
  Components: Autoscaler
Reporter: Gyula Fora


Currently Output ratios are computed during metric collection based on the 
current in/out metrics an stored as part of the collected metrics.

During evaluation the output ratios previously computed are then averaged 
together in the metric window. This however leads to incorrect computation due 
to the nature of the computation and averaging.

Example:
Let's look at a window operator that simply sorts and re-emits events in 
windows. During the window collection phase, output ratio will be computed and 
stored as 0. During the window computation the output ratio will be 
last_input_rate / window_size.  Depending on the last input rate observation 
this can be off when averaged into any direction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34265) Add the doc of named parameters

2024-01-29 Thread Feng Jin (Jira)

Feng Jin created FLINK-34265:


 Summary: Add the doc of named parameters
 Key: FLINK-34265
 URL: https://issues.apache.org/jira/browse/FLINK-34265
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Table SQL / Planner
Reporter: Feng Jin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

RE: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1

2024-01-29 Thread Jiabao Sun

Thanks Leonard for driving this.

+1(non-binding)

- Release notes look good
- Tag is present in Github
- Validated checksum hash
- Verified signature
- Build the source with Maven by jdk8,11,17,21
- Verified web PR and left minor comments
- Run a filter push down test by sql-client on Flink 1.18.1 and it works well

Best,
Jiabao


On 2024/01/29 12:33:23 Leonard Xu wrote:
> Hey all,
> 
> Please help review and vote on the release candidate #1 for the version 1.1.0 
> of the
> Apache Flink MongoDB Connector as follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * The official Apache source release to be deployed to dist.apache.org [2],
> which are signed with the key with fingerprint
> 5B2F6608732389AEB67331F5B197E1F1108998AD [3],
> * All artifacts to be deployed to the Maven Central Repository [4],
> * Source code tag v.1.0-rc1 [5],
> * Website pull request listing the new release [6].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> 
> Best,
> Leonard
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483
> [2] 
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1702/
> [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1
> [6] https://github.com/apache/flink-web/pull/719

[jira] [Created] (FLINK-34264) Prometheuspushgateway's hosturl can use basicAuth certification login

2024-01-29 Thread Yangzhou Huang (Jira)

Yangzhou Huang created FLINK-34264:
--

 Summary: Prometheuspushgateway's hosturl can use basicAuth 
certification login
 Key: FLINK-34264
 URL: https://issues.apache.org/jira/browse/FLINK-34264
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Affects Versions: 1.19.0
Reporter: Yangzhou Huang
 Fix For: 1.19.0


The scene is as follows:

Pushgateway uses Basicauth to verify, while the current code does not implement 
permissions on Basicauth.

hostUrl eg: 
[https://username:password@localhost:9091|https://username:password@localhost:9091/]

If this proposal is approved, I will propose a PR improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API

2024-01-29 Thread Thomas Weise

+1 (binding)


On Mon, Jan 29, 2024 at 5:45 AM Maximilian Michels  wrote:

> +1 (binding)
>
> On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > +1(binding)
> >
> > Best,
> > Rui
> >
> > On Fri, Jan 26, 2024 at 11:55 AM Xuyang  wrote:
> >
> > > +1 (non-binding)
> > >
> > >
> > > --
> > >
> > > Best！
> > > Xuyang
> > >
> > >
> > >
> > >
> > >
> > > 在 2024-01-26 10:12:34，"Hang Ruan"  写道：
> > > >Thanks for the FLIP.
> > > >
> > > >+1 (non-binding)
> > > >
> > > >Best,
> > > >Hang
> > > >
> > > >Mason Chen  于2024年1月26日周五 04:51写道：
> > > >
> > > >> Hi Devs,
> > > >>
> > > >> I would like to start a vote on FLIP-417: Expose
> > > JobManagerOperatorMetrics
> > > >> via REST API [1] which has been discussed in this thread [2].
> > > >>
> > > >> The vote will be open for at least 72 hours unless there is an
> > > objection or
> > > >> not enough votes.
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API
> > > >> [2]
> https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0
> > > >>
> > > >> Best,
> > > >> Mason
> > > >>
> > >
>

[jira] [Created] (FLINK-34263) Converting double to decimal may fail

2024-01-29 Thread Caican Cai (Jira)

Caican Cai created FLINK-34263:
--

 Summary: Converting double to decimal may fail
 Key: FLINK-34263
 URL: https://issues.apache.org/jira/browse/FLINK-34263
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Affects Versions: 1.18.1
Reporter: Caican Cai
 Fix For: 1.18.1


Converting double to decimal fails. When the value is infinity, an error will 
be reported when converting double to decimal.
{code:java}
/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.table.examples.java.basics;

import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.TableEnvironment;

/** The famous word count example that shows a minimal Flink SQL job in batch 
execution mode. */
public final class WordCountSQLExample {

public static void main(String[] args) throws Exception {

// set up the Table API
final EnvironmentSettings settings =
EnvironmentSettings.newInstance().inBatchMode().build();
final TableEnvironment tableEnv = TableEnvironment.create(settings);

// execute a Flink SQL job and print the result locally
tableEnv.executeSql(
// define the aggregation
"SELECT CAST(power(0,-3) AS DECIMAL), word, 
SUM(frequency) AS `count`\n"
// read from an artificial fixed-size table 
with rows and columns
+ "FROM (\n"
+ "  VALUES ('Hello', 1), ('Ciao', 1), 
('Hello', 2)\n"
+ ")\n"
// name the table and its columns
+ "AS WordTable(word, frequency)\n"
// group for aggregation
+ "GROUP BY word")
.print();
}
}
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1

2024-01-29 Thread Leonard Xu

Hey all,

Please help review and vote on the release candidate #1 for the version 1.1.0 
of the
Apache Flink MongoDB Connector as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* The official Apache source release to be deployed to dist.apache.org [2],
which are signed with the key with fingerprint
5B2F6608732389AEB67331F5B197E1F1108998AD [3],
* All artifacts to be deployed to the Maven Central Repository [4],
* Source code tag v.1.0-rc1 [5],
* Website pull request listing the new release [6].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.


Best,
Leonard
[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483
[2] 
https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1702/
[5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1
[6] https://github.com/apache/flink-web/pull/719

Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2

2024-01-29 Thread Xuannan Su

Hi Weijie,

Thanks for the FLIP! I have a few questions regarding the FLIP.

1. +1 to only use XXXParititionStream if users only need to use the
configurable PartitionStream.  If there are use cases for both,
perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or
`ConfigurableNonKeyedPartitionStream` for simplicity.

2. Should we allow users to set custom configurations through the
`ProcessConfigurable` interface and access these configurations in the
`ProcessFunction` via `RuntimeContext`? I believe it would be useful
for process function developers to be able to define custom
configurations.

3. How can users define custom metrics within the `ProcessFunction`?
Will there be a method like `getMetricGroup` available in the
`RuntimeContext`?

Best,
Xuannan


On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou
 wrote:
>
> Hi Weijie,
>
> Thanks for introducing this FLIP! I have a few questions about the
> designs proposed.
>
> 1. Would it be better to have all XXXPartitionStream classes implement
> ProcessConfigurable, instead of defining both XXXPartitionStream and
> ProcessConfigurableAndXXXPartitionStream? I wonder whether users would
> need to operate on a non-configurable PartitionStream.
>
> 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will
> there be classes other than XXXPartitionStream that implement this
> interface? Will "Process" be accurate enough to describe
> PartitionStream and those classes?
>
> 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar)
> methods, would it be better to also add a general
> withConfig(configKey, configValue) method to the ProcessConfigurable
> interface? Adding a method for each configuration might harm the
> readability and compatibility of configurations.
>
> Looking forward to your response.
>
> Best regards,
> Yunfeng Zhou
>
> On Tue, Dec 26, 2023 at 2:47 PM weijie guo  wrote:
> >
> > Hi devs,
> >
> >
> > I'd like to start a discussion about FLIP-410: Config, Context and
> > Processing Timer Service of DataStream API V2 [1]. This is the second
> > sub-FLIP of DataStream API V2.
> >
> >
> > In FLIP-409 [2], we have defined the most basic primitive of
> > DataStream V2. On this basis, this FLIP will further answer several
> > important questions closely related to it:
> >
> >1.
> >How to configure the processing over the datastreams, such as
> > setting the parallelism.
> >2.
> >How to get access to the runtime contextual information and
> > services from inside the process functions.
> >3. How to work with processing-time timers.
> >
> > You can find more details in this FLIP. Its relationship with other
> > sub-FLIPs can be found in the umbrella FLIP
> > [3].
> >
> >
> > Looking forward to hearing from you, thanks!
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2
> >
> > [2]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction
> >
> > [3]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2

Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API

2024-01-29 Thread Maximilian Michels

+1 (binding)

On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> +1(binding)
>
> Best,
> Rui
>
> On Fri, Jan 26, 2024 at 11:55 AM Xuyang  wrote:
>
> > +1 (non-binding)
> >
> >
> > --
> >
> > Best！
> > Xuyang
> >
> >
> >
> >
> >
> > 在 2024-01-26 10:12:34，"Hang Ruan"  写道：
> > >Thanks for the FLIP.
> > >
> > >+1 (non-binding)
> > >
> > >Best,
> > >Hang
> > >
> > >Mason Chen  于2024年1月26日周五 04:51写道：
> > >
> > >> Hi Devs,
> > >>
> > >> I would like to start a vote on FLIP-417: Expose
> > JobManagerOperatorMetrics
> > >> via REST API [1] which has been discussed in this thread [2].
> > >>
> > >> The vote will be open for at least 72 hours unless there is an
> > objection or
> > >> not enough votes.
> > >>
> > >> [1]
> > >>
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API
> > >> [2] https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0
> > >>
> > >> Best,
> > >> Mason
> > >>
> >

Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction

2024-01-29 Thread Xuannan Su

Hi Weijie,

Thank you for driving the design of the new DataStream API. I have a
few questions regarding the FLIP:

1. In the partitioning section, it says that "broadcast can only be
used as a side-input of other Inputs." Could you clarify what is meant
by "side-input"? If I understand correctly, it refer to one of the
inputs of the `TwoInputStreamProcessFunction`. If that is the case,
the term "side-input" may not be accurate.

2. Is there a particular reason we do not support a
`TwoInputProcessFunction` to combine a KeyedStream with a
BroadcastStream to result in a KeyedStream? There seems to be a valid
use case where a KeyedStream is enriched with a BroadcastStream and
returns a Stream that is partitioned in the same way.

3. There appears to be a typo in the example code. The
`SingleStreamProcessFunction` should probably be
`OneInputStreamProcessFunction`.

4. How do we set the global configuration for the
ExecutionEnvironment? Currently, we have the
StreamExecutionEnvironment.getExecutionEnvironment(Configuration)
method to provide the global configuration in the API.

5. I noticed that there are two `collect` methods in the Collector,
one with a timestamp and one without. Could you elaborate on the
differences between them? Additionally, in what use case would one use
the method that includes the timestamp?

Best regards,
Xuannan

On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou
 wrote:
>
> Hi Weijie,
>
> Thanks for raising discussions about the new DataStream API. I have a
> few questions about the content of the FLIP.
>
> 1. Will we provide any API to support choosing which input to consume
> between the two inputs of TwoInputStreamProcessFunction? It would be
> helpful in online machine learning cases, where a process function
> needs to receive the first machine learning model before it can start
> predictions on input data. Similar requirements might also exist in
> Flink CEP, where a rule set needs to be consumed by the process
> function first before it can start matching the event stream against
> CEP patterns.
>
> 2. A typo might exist in the current FLIP describing the API to
> generate a global stream, as I can see either global() or coalesce()
> in different places of the FLIP. These two methods might need to be
> unified into one method.
>
> 3. The order of parameters in the current ProcessFunction is (record,
> context, output), while this FLIP proposes to change the order into
> (record, output, context). Is there any reason to make this change?
>
> 4. Why does this FLIP propose to use connectAndProcess() instead of
> connect() (+ keyBy()) + process()? The latter looks simpler to me.
>
> Looking forward to discussing these questions with you.
>
> Best regards,
> Yunfeng Zhou
>
> On Tue, Dec 26, 2023 at 2:44 PM weijie guo  wrote:
> >
> > Hi devs,
> >
> >
> > I'd like to start a discussion about FLIP-409: DataStream V2 Building
> > Blocks: DataStream, Partitioning and ProcessFunction [1].
> >
> >
> > As the first sub-FLIP for DataStream API V2, we'd like to discuss and
> > try to answer some of the most fundamental questions in stream
> > processing:
> >
> >1. What kinds of data streams do we have?
> >2. How to partition data over the streams?
> >3. How to define a processing on the data stream?
> >
> > The answer to these questions involve three core concepts: DataStream,
> > Partitioning and ProcessFunction. In this FLIP, we will discuss the
> > definitions and related API primitives of these concepts in detail.
> >
> >
> > You can find more details in FLIP-409 [1]. This sub-FLIP is at the
> > heart of the entire DataStream API V2, and its relationship with other
> > sub-FLIPs can be found in the umbrella FLIP [2].
> >
> >
> > Looking forward to hearing from you, thanks!
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction
> >
> > [2]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2

Re:[VOTE] FLIP-418: Show data skew score on Flink Dashboard

2024-01-29 Thread 嘉伟




















At 2024-01-29 18:09:10, "Kartoglu, Emre"  wrote:
>Hello,
>
>I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard.
>
>FLIP: 
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
>Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql
>
>Kind regards,
>Emre
>

[VOTE] FLIP-418: Show data skew score on Flink Dashboard

2024-01-29 Thread Kartoglu, Emre

Hello,

I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard.

FLIP: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard
Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql

Kind regards,
Emre

[jira] [Created] (FLINK-34262) Support to set user specified labels for Rest Service Object

2024-01-29 Thread Prabhu Joseph (Jira)

Prabhu Joseph created FLINK-34262:
-

 Summary: Support to set user specified labels for Rest Service 
Object
 Key: FLINK-34262
 URL: https://issues.apache.org/jira/browse/FLINK-34262
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes
Affects Versions: 1.18.1
Reporter: Prabhu Joseph


Flink allows users to label JM and TM pods; the rest service object also 
requires labeling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34261) When slotmanager.redundant-taskmanager-num is set to a value greater than 1, redundant task managers may be repeatedly released and requested

2024-01-29 Thread junzhong qin (Jira)

junzhong qin created FLINK-34261:


 Summary: When slotmanager.redundant-taskmanager-num is set to a 
value greater than 1, redundant task managers may be repeatedly released and 
requested
 Key: FLINK-34261
 URL: https://issues.apache.org/jira/browse/FLINK-34261
 Project: Flink
  Issue Type: Bug
Reporter: junzhong qin
 Attachments: image-2024-01-29-17-29-15-453.png

Redundant task managers are extra task managers started by Flink, to speed up 
job recovery in case of failures due to task manager lost. But when we 
configured 
{code:java}
slotmanager.redundant-taskmanager-num: 2 // any value greater than 1{code}
Flink will release and request redundant TM repeatedly.

We can reproduce this situation by using [Flink Kubernetes Operator (using 
minikube 
here)|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/], 
here is an example yaml file:

 
{code:java}
// redundant-tm.yaml

#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
apiVersion:
 flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: redundant-tm
spec:
  image: flink:1.18
  flinkVersion: v1_18
  flinkConfiguration:
    taskmanager.numberOfTaskSlots: "2"
    slotmanager.redundant-taskmanager-num: "2"
    cluster.fine-grained-resource-management.enabled: "false"
  serviceAccount: flink
  jobManager:
    resource:
      memory: "1024m"
      cpu: 1
  taskManager:
    resource:
      memory: "1024m"
      cpu: 1
  job:
    jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
    parallelism: 3
    upgradeMode: stateless
  logConfiguration:
    log4j-console.properties: |+
      rootLogger.level = DEBUG
      rootLogger.appenderRef.file.ref = LogFile
      rootLogger.appenderRef.console.ref = LogConsole
      appender.file.name = LogFile
      appender.file.type = File
      appender.file.append = false
      appender.file.fileName = ${sys:log.file}
      appender.file.layout.type = PatternLayout
      appender.file.layout.pattern = %d{-MM-dd HH:mm:ss,SSS} %-5p %-60c %x 
- %m%n{code}
After executing:
{code:java}
kubectl create -f redundant-tm.yaml
kubectl port-forward svc/redundant-tm 8081{code}
We can find repeatedly release and request redundant TM in JM's log:
{code:java}
2024-01-29 09:26:25,033 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Registering TaskManager with ResourceID redundant-tm-taskmanager-1-4 
(pekko.tcp://flink@10.244.1.196:6122/user/rpc/taskmanager_0) at 
ResourceManager2024-01-29 09:26:25,060 DEBUG 
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Registering task executor redundant-tm-taskmanager-1-4 under 
44c649b2d84e87cdd5e6c53971f8b877 at the slot manager.2024-01-29 09:26:25,061 
INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] 
- Worker redundant-tm-taskmanager-1-4 is registered.2024-01-29 09:26:25,061 
INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] 
- Worker redundant-tm-taskmanager-1-4 with resource spec WorkerResourceSpec 
{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 
bytes), numSlots=2} was requested in current attempt. Current pending count 
after registering: 1.2024-01-29 09:26:25,196 DEBUG 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Update resource declarations to [ResourceDeclaration{spec=WorkerResourceSpec 
{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 
bytes), numSlots=2}, numNeeded=3, unwantedWorkers=[]}].2024-01-29 09:26:25,196 
INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] 
- need release 1 workers, current worker number 4, declared worker number 
32024-01-29 09:26:25,199 INFO

Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2

2024-01-29 Thread Xuannan Su

Hi Weijie,

Thanks for driving the work! There are indeed many pain points in the
current DataStream API, which are challenging to resolve with its
existing design. It is a great opportunity to propose a new DataStream
API that tackles these issues. I like the way we've divided the FLIP
into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1
for the umbrella FLIP. I am eager to see the sub-FLIPs!

Best regards,
Xuannan




On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu  wrote:
>
> Hi Weijie,
>
>
> Thank you for the effort you've put into the DataStream API ! By reorganizing 
> and
> redesigning the DataStream API, as well as addressing some of the unreasonable
> designs within it, we can enhance the efficiency of job development for 
> developers.
> It also allows developers to design more flexible Flink jobs to meet business 
> requirements.
>
>
> I have conducted a comprehensive review of the DataStream API design in 
> versions
> 1.18 and 1.19. I found quite a few functional defects in the DataStream API, 
> such as the
> lack of corresponding APIs in batch processing scenarios. In the upcoming 
> 1.20 version,
> I will further improve the DataStream API in batch computing scenarios.
>
>
> The issues existing in the old DataStream API (which can be referred to as 
> V1) can be
> addressed from a design perspective in the initial version of V2. I hope to 
> also have the
>  opportunity to participate in the development of DataStream V2 and make my 
> contribution.
>
>
> Regarding FLIP-408, I have a question: The Processing TimerService is 
> currently
> defined as one of the basic primitives, partly because it's understood that
> you have to choose between processing time and event time.
> The other part of the reason is that it needs to work based on the task's
> mailbox thread model to avoid concurrency issues. Could you clarify the second
> part of the reason?
>
> Best,
> Wencong Liu
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2023-12-26 14:42:20, "weijie guo"  wrote:
> >Hi devs,
> >
> >
> >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce
> >DataStream API V2 [1].
> >
> >
> >The DataStream API is one of the two main APIs that Flink provides for
> >writing data processing programs. As an API that was introduced
> >practically since day-1 of the project and has been evolved for nearly
> >a decade, we are observing more and more problems of it. Improvements
> >on these problems require significant breaking changes, which makes
> >in-place refactor impractical. Therefore, we propose to introduce a
> >new set of APIs, the DataStream API V2, to gradually replace the
> >original DataStream API.
> >
> >
> >The proposal to introduce a whole set new API is complex and includes
> >massive changes. We are planning  to break it down into multiple
> >sub-FLIPs for incremental discussion. This FLIP is only used as an
> >umbrella, mainly focusing on motivation, goals, and overall planning.
> >That is to say, more design and implementation details  will be
> >discussed in other FLIPs.
> >
> >
> >Given that it's hard to imagine the detailed design of the new API if
> >we're just talking about this umbrella FLIP, and we probably won't be
> >able to give an opinion on it. Therefore, I have prepared two
> >sub-FLIPs [2][3] at the same time, and the discussion of them will be
> >posted later in separate threads.
> >
> >
> >Looking forward to hearing from you, thanks!
> >
> >
> >Best regards,
> >
> >Weijie
> >
> >
> >
> >[1]
> >https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
> >
> >[2]
> >https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction
> >
> >
> >[3]
> >https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2

[DISCUSS] Drop support for HBase v1

2024-01-29 Thread Martijn Visser

Hi all,

While working on adding support for Flink 1.19 for HBase, we've run into a
dependency convergence issue because HBase v1 relies on a really old
version of Guava.

HBase v2 has been made available since May 2018, and there have been no new
releases of HBase v1 since August 2022.

I would like to propose that the Flink HBase connector drops support for
HBase v1, and will only continue HBase v2 in the future. I don't think this
requires a full FLIP and vote, but I do want to start a discussion thread
for this.

Best regards,

Martijn

[jira] [Created] (FLINK-34260) Make flink-connector-aws compatible with SinkV2 changes

2024-01-29 Thread Martijn Visser (Jira)

Martijn Visser created FLINK-34260:
--

 Summary: Make flink-connector-aws compatible with SinkV2 changes
 Key: FLINK-34260
 URL: https://issues.apache.org/jira/browse/FLINK-34260
 Project: Flink
  Issue Type: Bug
  Components: Connectors / AWS
Affects Versions: aws-connector-4.3.0
Reporter: Martijn Visser


https://github.com/apache/flink-connector-aws/actions/runs/7689300085/job/20951547366#step:9:798

{code:java}
Error:  Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile 
(default-testCompile) on project flink-connector-dynamodb: Compilation failure
Error:  
/home/runner/work/flink-connector-aws/flink-connector-aws/flink-connector-aws/flink-connector-dynamodb/src/test/java/org/apache/flink/connector/dynamodb/sink/DynamoDbSinkWriterTest.java:[357,40]
 incompatible types: 
org.apache.flink.connector.base.sink.writer.TestSinkInitContext cannot be 
converted to org.apache.flink.api.connector.sink2.Sink.InitContext
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-34259) flink-connector-jdbc fails to compile with NPE on hasGenericTypesDisabled

2024-01-29 Thread Martijn Visser (Jira)

Martijn Visser created FLINK-34259:
--

 Summary: flink-connector-jdbc fails to compile with NPE on 
hasGenericTypesDisabled
 Key: FLINK-34259
 URL: https://issues.apache.org/jira/browse/FLINK-34259
 Project: Flink
  Issue Type: Bug
  Components: Connectors / JDBC
Reporter: Martijn Visser


https://github.com/apache/flink-connector-jdbc/actions/runs/7682035724/job/20935884874#step:14:150

{code:java}
Error:  Tests run: 10, Failures: 5, Errors: 4, Skipped: 0, Time elapsed: 7.909 
s <<< FAILURE! - in org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest
Error:  
org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest.testInvalidConnectionInJdbcOutputFormat
  Time elapsed: 3.254 s  <<< ERROR!
java.lang.NullPointerException: Cannot invoke 
"org.apache.flink.api.common.serialization.SerializerConfig.hasGenericTypesDisabled()"
 because "config" is null
at 
org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:85)
at 
org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:99)
at 
org.apache.flink.connector.jdbc.JdbcTestBase.getSerializer(JdbcTestBase.java:70)
at 
org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest.testInvalidConnectionInJdbcOutputFormat(JdbcRowOutputFormatTest.java:336)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
{code}

Seems to be caused by FLINK-34122 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

43 matches

Mail list logo