[jira] [Created] (FLINK-34272) AdaptiveSchedulerClusterITCase failure due to MiniCluster not running
Matthias Pohl created FLINK-34272: - Summary: AdaptiveSchedulerClusterITCase failure due to MiniCluster not running Key: FLINK-34272 URL: https://issues.apache.org/jira/browse/FLINK-34272 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.19.0 Reporter: Matthias Pohl [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=57073=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=9543] {code:java} Jan 29 17:21:29 17:21:29.465 [ERROR] Tests run: 3, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 12.48 s <<< FAILURE! -- in org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase Jan 29 17:21:29 17:21:29.465 [ERROR] org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testAutomaticScaleUp -- Time elapsed: 8.599 s <<< ERROR! Jan 29 17:21:29 java.lang.IllegalStateException: MiniCluster is not yet running or has already been shut down. Jan 29 17:21:29 at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1118) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:991) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.getArchivedExecutionGraph(MiniCluster.java:840) Jan 29 17:21:29 at org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.lambda$waitUntilParallelismForVertexReached$3(AdaptiveSchedulerClusterITCase.java:270) Jan 29 17:21:29 at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:151) Jan 29 17:21:29 at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145) Jan 29 17:21:29 at org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.waitUntilParallelismForVertexReached(AdaptiveSchedulerClusterITCase.java:265) Jan 29 17:21:29 at org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testAutomaticScaleUp(AdaptiveSchedulerClusterITCase.java:146) Jan 29 17:21:29 at java.lang.reflect.Method.invoke(Method.java:498) Jan 29 17:21:29 at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) Jan 29 17:21:29 Jan 29 17:21:29 17:21:29.466 [ERROR] org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale -- Time elapsed: 2.036 s <<< ERROR! Jan 29 17:21:29 java.lang.IllegalStateException: MiniCluster is not yet running or has already been shut down. Jan 29 17:21:29 at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.getDispatcherGatewayFuture(MiniCluster.java:1118) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:991) Jan 29 17:21:29 at org.apache.flink.runtime.minicluster.MiniCluster.getExecutionGraph(MiniCluster.java:969) Jan 29 17:21:29 at org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.lambda$testCheckpointStatsPersistedAcrossRescale$1(AdaptiveSchedulerClusterITCase.java:183) Jan 29 17:21:29 at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:151) Jan 29 17:21:29 at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145) Jan 29 17:21:29 at org.apache.flink.runtime.scheduler.adaptive.AdaptiveSchedulerClusterITCase.testCheckpointStatsPersistedAcrossRescale(AdaptiveSchedulerClusterITCase.java:180) Jan 29 17:21:29 at java.lang.reflect.Method.invoke(Method.java:498) Jan 29 17:21:29 at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Release flink-connector-kafka v3.1.0, release candidate #1
+1 (non-binding) * Verified LICENSE and NOTICE files (this RC has a NOTICE file that points to 2023 that has since been updated on the main branch by Hang) * Verified hashes and signatures * Verified no binaries * Verified poms point to 3.1.0 * Reviewed web PR * Built from source * Verified git tag In the same vein as the web PR, do we want to prepare the PR to update the shortcode in the connector docs now [1]? Same for the Chinese version. I wonder if that should be included in the connector release instructions. [1] https://github.com/apache/flink-connector-kafka/blob/d89a082180232bb79e3c764228c4e7dbb9eb6b8b/docs/content/docs/connectors/datastream/kafka.md#L39 Best, Mason On Sun, Jan 28, 2024 at 11:41 PM Hang Ruan wrote: > +1 (non-binding) > > - Validated checksum hash > - Verified signature > - Verified that no binaries exist in the source archive > - Build the source with Maven and jdk11 > - Verified web PR > - Check that the jar is built by jdk8 > > Best, > Hang > > Martijn Visser 于2024年1月26日周五 21:05写道: > > > Hi everyone, > > Please review and vote on the release candidate #1 for the Flink Kafka > > connector version 3.1.0, as follows: > > [ ] +1, Approve the release > > [ ] -1, Do not approve the release (please provide specific comments) > > > > This release is compatible with Flink 1.17.* and Flink 1.18.* > > > > The complete staging area is available for your review, which includes: > > * JIRA release notes [1], > > * the official Apache source release to be deployed to dist.apache.org > > [2], > > which are signed with the key with fingerprint > > A5F3BCE4CBE993573EC5966A65321B8382B219AF [3], > > * all artifacts to be deployed to the Maven Central Repository [4], > > * source code tag v3.1.0-rc1 [5], > > * website pull request listing the new release [6]. > > > > The vote will be open for at least 72 hours. It is adopted by majority > > approval, with at least 3 PMC affirmative votes. > > > > Thanks, > > Release Manager > > > > [1] > > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353135 > > [2] > > > > > https://dist.apache.org/repos/dist/dev/flink/flink-connector-kafka-3.1.0-rc1 > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > [4] > https://repository.apache.org/content/repositories/orgapacheflink-1700 > > [5] > > https://github.com/apache/flink-connector-kafka/releases/tag/v3.1.0-rc1 > > [6] https://github.com/apache/flink-web/pull/718 > > >
Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment
Thanks Dong and Xuannan, The updated FLIP LGTM. Best regards, Weijie Xuannan Su 于2024年1月30日周二 11:10写道: > Hi all, > > Thanks for the comments and suggestions. If there are no further > comments, I will open the voting thread tomorrow. > > Best regards, > Xuannan > On Fri, Jan 26, 2024 at 3:16 PM Dong Lin wrote: > > > > Thanks Xuannan for the update! > > > > +1 (binding) > > > > On Wed, Jan 10, 2024 at 5:54 PM Xuannan Su > wrote: > > > > > Hi all, > > > > > > After several rounds of offline discussions with Xingtong and Jinhao, > > > we have decided to narrow the scope of the FLIP. It will now focus on > > > introducing OperatorAttributes that indicate whether an operator emits > > > records only after inputs have ended. We will also use the attribute > > > to optimize task scheduling for better resource utilization. Setting > > > the backlog status and optimizing the operator implementation during > > > the backlog will be deferred to future work. > > > > > > In addition to the change above, we also make the following changes to > > > the FLIP to address the problems mentioned by Dong: > > > - Public interfaces are updated to reuse the GlobalWindows. > > > - Instead of making all outputs of the upstream operators of the > > > "isOutputOnlyAfterEndOfStream=true" operator blocking, we only make > > > the output of the operator with "isOutputOnlyAfterEndOfStream=true" > > > blocking. This can prevent the second problem Dong mentioned. In the > > > future, we may introduce an extra OperatorAttributes to indicate if an > > > operator has any side output. > > > > > > I would greatly appreciate any comment or feedback you may have on the > > > updated FLIP. > > > > > > Best regards, > > > Xuannan > > > > > > On Tue, Sep 26, 2023 at 11:24 AM Dong Lin wrote: > > > > > > > > Hi all, > > > > > > > > Thanks for the review! > > > > > > > > Becket and I discussed this FLIP offline and we agreed on several > things > > > > that need to be improved with this FLIP. I will summarize our > discussion > > > > with the problems and TODOs. We will update the FLIP and let you know > > > once > > > > the FLIP is ready for review again. > > > > > > > > 1) Investigate whether it is possible to update the existing > > > GlobalWindows > > > > in a backward-compatible way and re-use it for the same purpose > > > > as EndOfStreamWindows, without introducing EndOfStreamWindows as a > new > > > > class. > > > > > > > > Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger > instance > > > > which will not trigger window's computation even on end-of-inputs. We > > > will > > > > need to investigate its existing usage and see if we can re-use it > in a > > > > backward-compatible way. > > > > > > > > 2) Let JM know whether any operator in the upstream of the operator > with > > > > "isOutputOnEOF=true" will emit output via any side channel. The FLIP > > > should > > > > update the execution mode of those operators *only if* all outputs > from > > > > those operators are emitted only at the end of input. > > > > > > > > More specifically, the upstream operator might involve a user-defined > > > > operator that might emit output directly to an external service, > where > > > the > > > > emission operation is not explicitly expressed as an operator's > output > > > edge > > > > and thus not visible to JM. Similarly, it is also possible for the > > > > user-defined operator to register a timer > > > > via InternalTimerService#registerEventTimeTimer and emit output to an > > > > external service inside Triggerable#onEventTime. There is a chance > that > > > > users still need related logic to output data in real-time, even if > the > > > > downstream operators have isOutputOnEOF=true. > > > > > > > > One possible solution to address this problem is to add an extra > > > > OperatorAttribute to specify whether this operator might output > records > > > in > > > > such a way that does not go through operator's output (e.g. side > output). > > > > Then the JM can safely enable the runtime optimization currently > > > described > > > > in the FLIP when there is no such operator. > > > > > > > > 3) Create a follow-up FLIP that allows users to specify whether a > source > > > > with Boundedness=bounded should have isProcessingBacklog=true. > > > > > > > > This capability would effectively introduce a 3rd strategy to set > backlog > > > > status (in addition to FLIP-309 and FLIP-328). It might be useful to > note > > > > that, even though the data in bounded sources are backlog data in > most > > > > practical use-cases, it is not necessarily true. For example, users > might > > > > want to start a Flink job to consume real-time data from a Kafka > topic > > > and > > > > specify that the job stops after 24 hours, which means the source is > > > > technically bounded while the data is fresh/real-time. > > > > > > > > This capability is more generic and can cover more use-case than > > > > EndOfStreamWindows. On the other hand,
Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2
Hi Wencong, > Q1. In the "Configuration" section, it is mentioned that configurations can be set continuously using the withXXX methods. Are these configuration options the same as those provided by DataStream V1, or might there be different options compared to V1? I haven't considered options that don't exist in V1 yet, but we may have some new options as we continue to develop. > Q2. The FLIP describes the interface for handling processing timers (ProcessingTimeManager), but it does not mention how to delete or update an existing timer. V1 API provides TimeService that could delete a timer. Does this mean that once a timer is registered, it cannot be changed? I think we do need to introduce a method to delete the timer, but I'm kind of curious why we need to update the timer instead of registering a new one. Anyway, I have updated the FLIP to support delete the timer. Best regards, Weijie weijie guo 于2024年1月30日周二 14:35写道: > Hi Xuannan, > > > 1. +1 to only use XXXParititionStream if users only need to use the > configurable PartitionStream. If there are use cases for both, > perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or > `ConfigurableNonKeyedPartitionStream` for simplicity. > > As for why we need both, you can refer to my reply to Yunfeng's first > question. As for the name, I can accept > ProcessConfigurableNonKeyedPartitionStream or keep the status quo. But I > don't want to change it to ConfigurableNonKeyedPartitionStream, the reason > is the same, because the configuration is applied to the Process rather > than the Stream. > > > Should we allow users to set custom configurations through the > `ProcessConfigurable` interface and access these configurations in the > `ProcessFunction` via `RuntimeContext`? I believe it would be useful > for process function developers to be able to define custom > configurations. > > If I understand you correctly, you want to set custom properties for > processing. The current configurations are mostly for the runtime engine, > such as determining the underlying operator 's parallelism and SSG. But I'm > not aware of the need to pass in a custom value(independent of the > framework itself) and then get it at runtime from RuntimeContext. Could > you give some examples? > > > How can users define custom metrics within the `ProcessFunction`? > Will there be a method like `getMetricGroup` available in the > `RuntimeContext`? > > I think this is a reasonable request. For extensibility, I have added the > getMetricManager instead of getMetricGroup to RuntimeContext, we can use > it to get the MetricGroup. > > > Best regards, > > Weijie > > > weijie guo 于2024年1月30日周二 13:45写道: > >> Thanks Yunfeng, >> >> Let me try to answer your question :) >> >> > 1. Would it be better to have all XXXPartitionStream classes implement >> ProcessConfigurable, instead of defining both XXXPartitionStream and >> ProcessConfigurableAndXXXPartitionStream? I wonder whether users would >> need to operate on a non-configurable PartitionStream. >> >> I thought about this for a while and decided to separate DataStream from >> ProcessConfigurable. At the core of this is that streams and c >> onfigurations are completely orthogonal concepts, and configuration is >> only responsible for the `Process`, not the `Stream`. This is why only >> the `process/connectAndProcess` returns configurable stream, but >> partitioning like `KeyBy` returns a pure DataStream. This may also answer >> your second question in passing. >> >> >> > Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) >> methods, would it be better to also add a general >> withConfig(configKey, configValue) method to the ProcessConfigurable >> interface? Adding a method for each configuration might harm the >> readability and compatibility of configurations. >> >> Sorry, I may not fully understand this question. ProcessConfigurable >> simply refers to the configuration of the Process, which can have the name, >> parallelism, etc of the process. It's not actually the >> Configuratiion(Contains >> a lot of ConfigOptions) that we usually talk about, but more like >> `SingleOutputStreamOperator` in DataStream V1. >> >> Best regards, >> >> Weijie >> >> >> Xuannan Su 于2024年1月29日周一 18:45写道: >> >>> Hi Weijie, >>> >>> Thanks for the FLIP! I have a few questions regarding the FLIP. >>> >>> 1. +1 to only use XXXParititionStream if users only need to use the >>> configurable PartitionStream. If there are use cases for both, >>> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or >>> `ConfigurableNonKeyedPartitionStream` for simplicity. >>> >>> 2. Should we allow users to set custom configurations through the >>> `ProcessConfigurable` interface and access these configurations in the >>> `ProcessFunction` via `RuntimeContext`? I believe it would be useful >>> for process function developers to be able to define custom >>> configurations. >>> >>> 3. How can users define custom metrics within the
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Hi Xintong, Thanks for your reply. > Does this mean if we want to support (KeyedStream, BroadcastStream) -> (KeyedStream), we must make sure that no data can be output upon processing records from the input BroadcastStream? That's probably a reasonable limitation. I think so, this is the restriction that has to be imposed in order to avoid re-partition(i.e. shuffle). If one just want to get a keyed-stream and don't care about the data distribution, then explicit KeyBy partitioning works as expected. > The problem is would this limitation be too implicit for the users to understand. Since we can't check for this limitation at compile time, if we were to add support for this case, we would have to introduce additional runtime checks to ensure program correctness. For now, I'm inclined not to support it, as it's hard for users to understand this restriction unless we have something better. And we can always add it later if we do realize there's a strong demand for it. > 1. I'd suggest renaming the method with timestamp to something like `collectAndOverwriteTimestamp`. That might help users understand that they don't always need to call this method, unless they explicitly want to overwrite the timestamp. Make sense, I have updated this FLIP toward this new method name. > 2. While this method provides a way to set timestamps, how would users read timestamps from the records? Ah, good point. I will introduce a new method to get the timestamp of the current record in RuntimeContext. Best regards, Weijie Xintong Song 于2024年1月30日周二 14:04写道: > Just trying to understand. > > > Is there a particular reason we do not support a > > `TwoInputProcessFunction` to combine a KeyedStream with a > > BroadcastStream to result in a KeyedStream? There seems to be a valid > > use case where a KeyedStream is enriched with a BroadcastStream and > > returns a Stream that is partitioned in the same way. > > > > The key point here is that if the returned stream is a KeyedStream, we > > require that the partition of input and output be the same. As for the > > data on the broadcast edge, it will be broadcast to all parallelism, we > > cannot keep the data partition consistent. For example, if a specific > > record is sent to both SubTask1 and SubTask2, after processing, the > > partition index calculated by the new KeySelector is `1`, then the data > > distribution of SubTask2 has obviously changed. > > > Does this mean if we want to support (KeyedStream, BroadcastStream) -> > (KeyedStream), we must make sure that no data can be output upon processing > records from the input BroadcastStream? That's probably a reasonable > limitation. The problem is would this limitation be too implicit for the > users to understand. > > > > I noticed that there are two `collect` methods in the Collector, > > > > one with a timestamp and one without. Could you elaborate on the > > differences between them? Additionally, in what use case would one use > > the method that includes the timestamp? > > > > > > That's a good question, and it's mostly used with time-related operators > > such as Window. First, we want to give the process function the ability > to > > reset timestamps, which makes it more flexible than the original > > API. Second, we don't want to take the timestamp extraction > > operator/function as a base primitive, it's more like a high-level > > extension. Therefore, the framework must provide this functionality. > > > > > 1. I'd suggest renaming the method with timestamp to something like > `collectAndOverwriteTimestamp`. That might help users understand that they > don't always need to call this method, unless they explicitly want to > overwrite the timestamp. > > 2. While this method provides a way to set timestamps, how would users read > timestamps from the records? > > > Best, > > Xintong > > > > On Tue, Jan 30, 2024 at 12:45 PM weijie guo > wrote: > > > Hi Xuannan, > > > > Thank you for your attention. > > > > > In the partitioning section, it says that "broadcast can only be > > used as a side-input of other Inputs." Could you clarify what is meant > > by "side-input"? If I understand correctly, it refer to one of the > > inputs of the `TwoInputStreamProcessFunction`. If that is the case, > > the term "side-input" may not be accurate. > > > > Yes, you got it right! I have rewrote this sentence to avoid > > misunderstanding. > > > > > Is there a particular reason we do not support a > > `TwoInputProcessFunction` to combine a KeyedStream with a > > BroadcastStream to result in a KeyedStream? There seems to be a valid > > use case where a KeyedStream is enriched with a BroadcastStream and > > returns a Stream that is partitioned in the same way. > > > > The key point here is that if the returned stream is a KeyedStream, we > > require that the partition of input and output be the same. As for the > > data on the broadcast edge, it will be broadcast to all parallelism, we > > cannot keep the data partition
Re: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1
Thanks Hang for the review! We should use jdk8 to compile the jar, I will cancel the rc1 and prepare rc2 soon. Best, Leonard > 2024年1月30日 下午2:39,Hang Ruan 写道: > > Hi, Leonard. > > I find that META-INF/MANIFEST.MF in > flink-sql-connector-mongodb-1.1.0-1.18.jar shows as follow. > > Manifest-Version: 1.0 > Archiver-Version: Plexus Archiver > Created-By: Apache Maven 3.8.1 > Built-By: bangjiangxu > Build-Jdk: 11.0.11 > Specification-Title: Flink : Connectors : SQL : MongoDB > Specification-Version: 1.1.0-1.18 > Specification-Vendor: The Apache Software Foundation > Implementation-Title: Flink : Connectors : SQL : MongoDB > Implementation-Version: 1.1.0-1.18 > Implementation-Vendor-Id: org.apache.flink > Implementation-Vendor: The Apache Software Foundation > > Maybe we should build mongodb connector with jdk8. > > Best, > Hang > > Jiabao Sun 于2024年1月29日周一 21:51写道: > >> Thanks Leonard for driving this. >> >> +1(non-binding) >> >> - Release notes look good >> - Tag is present in Github >> - Validated checksum hash >> - Verified signature >> - Build the source with Maven by jdk8,11,17,21 >> - Verified web PR and left minor comments >> - Run a filter push down test by sql-client on Flink 1.18.1 and it works >> well >> >> Best, >> Jiabao >> >> >> On 2024/01/29 12:33:23 Leonard Xu wrote: >>> Hey all, >>> >>> Please help review and vote on the release candidate #1 for the version >> 1.1.0 of the >>> Apache Flink MongoDB Connector as follows: >>> >>> [ ] +1, Approve the release >>> [ ] -1, Do not approve the release (please provide specific comments) >>> >>> The complete staging area is available for your review, which includes: >>> * JIRA release notes [1], >>> * The official Apache source release to be deployed to dist.apache.org >> [2], >>> which are signed with the key with fingerprint >>> 5B2F6608732389AEB67331F5B197E1F1108998AD [3], >>> * All artifacts to be deployed to the Maven Central Repository [4], >>> * Source code tag v.1.0-rc1 [5], >>> * Website pull request listing the new release [6]. >>> >>> The vote will be open for at least 72 hours. It is adopted by majority >>> approval, with at least 3 PMC affirmative votes. >>> >>> >>> Best, >>> Leonard >>> [1] >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483 >>> [2] >> https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/ >>> [3] https://dist.apache.org/repos/dist/release/flink/KEYS >>> [4] >> https://repository.apache.org/content/repositories/orgapacheflink-1702/ >>> [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1 >>> [6] https://github.com/apache/flink-web/pull/719
Re: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1
Hi, Leonard. I find that META-INF/MANIFEST.MF in flink-sql-connector-mongodb-1.1.0-1.18.jar shows as follow. Manifest-Version: 1.0 Archiver-Version: Plexus Archiver Created-By: Apache Maven 3.8.1 Built-By: bangjiangxu Build-Jdk: 11.0.11 Specification-Title: Flink : Connectors : SQL : MongoDB Specification-Version: 1.1.0-1.18 Specification-Vendor: The Apache Software Foundation Implementation-Title: Flink : Connectors : SQL : MongoDB Implementation-Version: 1.1.0-1.18 Implementation-Vendor-Id: org.apache.flink Implementation-Vendor: The Apache Software Foundation Maybe we should build mongodb connector with jdk8. Best, Hang Jiabao Sun 于2024年1月29日周一 21:51写道: > Thanks Leonard for driving this. > > +1(non-binding) > > - Release notes look good > - Tag is present in Github > - Validated checksum hash > - Verified signature > - Build the source with Maven by jdk8,11,17,21 > - Verified web PR and left minor comments > - Run a filter push down test by sql-client on Flink 1.18.1 and it works > well > > Best, > Jiabao > > > On 2024/01/29 12:33:23 Leonard Xu wrote: > > Hey all, > > > > Please help review and vote on the release candidate #1 for the version > 1.1.0 of the > > Apache Flink MongoDB Connector as follows: > > > > [ ] +1, Approve the release > > [ ] -1, Do not approve the release (please provide specific comments) > > > > The complete staging area is available for your review, which includes: > > * JIRA release notes [1], > > * The official Apache source release to be deployed to dist.apache.org > [2], > > which are signed with the key with fingerprint > > 5B2F6608732389AEB67331F5B197E1F1108998AD [3], > > * All artifacts to be deployed to the Maven Central Repository [4], > > * Source code tag v.1.0-rc1 [5], > > * Website pull request listing the new release [6]. > > > > The vote will be open for at least 72 hours. It is adopted by majority > > approval, with at least 3 PMC affirmative votes. > > > > > > Best, > > Leonard > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483 > > [2] > https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/ > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > [4] > https://repository.apache.org/content/repositories/orgapacheflink-1702/ > > [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1 > > [6] https://github.com/apache/flink-web/pull/719
Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2
Hi Xuannan, > 1. +1 to only use XXXParititionStream if users only need to use the configurable PartitionStream. If there are use cases for both, perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or `ConfigurableNonKeyedPartitionStream` for simplicity. As for why we need both, you can refer to my reply to Yunfeng's first question. As for the name, I can accept ProcessConfigurableNonKeyedPartitionStream or keep the status quo. But I don't want to change it to ConfigurableNonKeyedPartitionStream, the reason is the same, because the configuration is applied to the Process rather than the Stream. > Should we allow users to set custom configurations through the `ProcessConfigurable` interface and access these configurations in the `ProcessFunction` via `RuntimeContext`? I believe it would be useful for process function developers to be able to define custom configurations. If I understand you correctly, you want to set custom properties for processing. The current configurations are mostly for the runtime engine, such as determining the underlying operator 's parallelism and SSG. But I'm not aware of the need to pass in a custom value(independent of the framework itself) and then get it at runtime from RuntimeContext. Could you give some examples? > How can users define custom metrics within the `ProcessFunction`? Will there be a method like `getMetricGroup` available in the `RuntimeContext`? I think this is a reasonable request. For extensibility, I have added the getMetricManager instead of getMetricGroup to RuntimeContext, we can use it to get the MetricGroup. Best regards, Weijie weijie guo 于2024年1月30日周二 13:45写道: > Thanks Yunfeng, > > Let me try to answer your question :) > > > 1. Would it be better to have all XXXPartitionStream classes implement > ProcessConfigurable, instead of defining both XXXPartitionStream and > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would > need to operate on a non-configurable PartitionStream. > > I thought about this for a while and decided to separate DataStream from > ProcessConfigurable. At the core of this is that streams and configuration > s are completely orthogonal concepts, and configuration is only > responsible for the `Process`, not the `Stream`. This is why only the ` > process/connectAndProcess` returns configurable stream, but partitioning > like `KeyBy` returns a pure DataStream. This may also answer your second > question in passing. > > > > Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) > methods, would it be better to also add a general > withConfig(configKey, configValue) method to the ProcessConfigurable > interface? Adding a method for each configuration might harm the > readability and compatibility of configurations. > > Sorry, I may not fully understand this question. ProcessConfigurable > simply refers to the configuration of the Process, which can have the name, > parallelism, etc of the process. It's not actually the Configuratiion(Contains > a lot of ConfigOptions) that we usually talk about, but more like > `SingleOutputStreamOperator` in DataStream V1. > > Best regards, > > Weijie > > > Xuannan Su 于2024年1月29日周一 18:45写道: > >> Hi Weijie, >> >> Thanks for the FLIP! I have a few questions regarding the FLIP. >> >> 1. +1 to only use XXXParititionStream if users only need to use the >> configurable PartitionStream. If there are use cases for both, >> perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or >> `ConfigurableNonKeyedPartitionStream` for simplicity. >> >> 2. Should we allow users to set custom configurations through the >> `ProcessConfigurable` interface and access these configurations in the >> `ProcessFunction` via `RuntimeContext`? I believe it would be useful >> for process function developers to be able to define custom >> configurations. >> >> 3. How can users define custom metrics within the `ProcessFunction`? >> Will there be a method like `getMetricGroup` available in the >> `RuntimeContext`? >> >> Best, >> Xuannan >> >> >> On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou >> wrote: >> > >> > Hi Weijie, >> > >> > Thanks for introducing this FLIP! I have a few questions about the >> > designs proposed. >> > >> > 1. Would it be better to have all XXXPartitionStream classes implement >> > ProcessConfigurable, instead of defining both XXXPartitionStream and >> > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would >> > need to operate on a non-configurable PartitionStream. >> > >> > 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will >> > there be classes other than XXXPartitionStream that implement this >> > interface? Will "Process" be accurate enough to describe >> > PartitionStream and those classes? >> > >> > 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) >> > methods, would it be better to also add a general >> > withConfig(configKey, configValue) method to the ProcessConfigurable >> > interface?
Re: [VOTE] Release flink-connector-jdbc, release candidate #2
+1 (non-binding) - Validated checksum hash - Verified signature - Verified that no binaries exist in the source archive - Build the source with Maven and jdk11 - Verified web PR - Check that the jar is built by jdk8 Best, Hang Jiabao Sun 于2024年1月30日周二 10:52写道: > +1(non-binding) > > - Release notes look good > - Tag is present in Github > - Validated checksum hash > - Verified signature > - Verified web PR and left minor comments > > Best, > Jiabao > > > On 2024/01/30 00:17:54 Sergey Nuyanzin wrote: > > Hi everyone, > > Please review and vote on the release candidate #2 for the version > > 3.1.2, as follows: > > [ ] +1, Approve the release > > [ ] -1, Do not approve the release (please provide specific comments) > > > > This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x. > > > > The complete staging area is available for your review, which includes: > > * JIRA release notes [1], > > * the official Apache source release to be deployed to dist.apache.org > > [2], which are signed with the key with fingerprint > > 1596BBF0726835D8 [3], > > * all artifacts to be deployed to the Maven Central Repository [4], > > * source code tag v3.1.2-rc2 [5], > > * website pull request listing the new release [6]. > > > > The vote will be open for at least 72 hours. It is adopted by majority > > approval, with at least 3 PMC affirmative votes. > > > > Thanks, > > Release Manager > > > > [1] > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088 > > [2] > > > https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2 > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > > [4] > https://repository.apache.org/content/repositories/orgapacheflink-1704/ > > [5] > https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2 > > [6] https://github.com/apache/flink-web/pull/707 > >
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Just trying to understand. > Is there a particular reason we do not support a > `TwoInputProcessFunction` to combine a KeyedStream with a > BroadcastStream to result in a KeyedStream? There seems to be a valid > use case where a KeyedStream is enriched with a BroadcastStream and > returns a Stream that is partitioned in the same way. > The key point here is that if the returned stream is a KeyedStream, we > require that the partition of input and output be the same. As for the > data on the broadcast edge, it will be broadcast to all parallelism, we > cannot keep the data partition consistent. For example, if a specific > record is sent to both SubTask1 and SubTask2, after processing, the > partition index calculated by the new KeySelector is `1`, then the data > distribution of SubTask2 has obviously changed. Does this mean if we want to support (KeyedStream, BroadcastStream) -> (KeyedStream), we must make sure that no data can be output upon processing records from the input BroadcastStream? That's probably a reasonable limitation. The problem is would this limitation be too implicit for the users to understand. > I noticed that there are two `collect` methods in the Collector, > > one with a timestamp and one without. Could you elaborate on the > differences between them? Additionally, in what use case would one use > the method that includes the timestamp? > > > That's a good question, and it's mostly used with time-related operators > such as Window. First, we want to give the process function the ability to > reset timestamps, which makes it more flexible than the original > API. Second, we don't want to take the timestamp extraction > operator/function as a base primitive, it's more like a high-level > extension. Therefore, the framework must provide this functionality. > > 1. I'd suggest renaming the method with timestamp to something like `collectAndOverwriteTimestamp`. That might help users understand that they don't always need to call this method, unless they explicitly want to overwrite the timestamp. 2. While this method provides a way to set timestamps, how would users read timestamps from the records? Best, Xintong On Tue, Jan 30, 2024 at 12:45 PM weijie guo wrote: > Hi Xuannan, > > Thank you for your attention. > > > In the partitioning section, it says that "broadcast can only be > used as a side-input of other Inputs." Could you clarify what is meant > by "side-input"? If I understand correctly, it refer to one of the > inputs of the `TwoInputStreamProcessFunction`. If that is the case, > the term "side-input" may not be accurate. > > Yes, you got it right! I have rewrote this sentence to avoid > misunderstanding. > > > Is there a particular reason we do not support a > `TwoInputProcessFunction` to combine a KeyedStream with a > BroadcastStream to result in a KeyedStream? There seems to be a valid > use case where a KeyedStream is enriched with a BroadcastStream and > returns a Stream that is partitioned in the same way. > > The key point here is that if the returned stream is a KeyedStream, we > require that the partition of input and output be the same. As for the > data on the broadcast edge, it will be broadcast to all parallelism, we > cannot keep the data partition consistent. For example, if a specific > record is sent to both SubTask1 and SubTask2, after processing, the > partition index calculated by the new KeySelector is `1`, then the data > distribution of SubTask2 has obviously changed. > > > 3. There appears to be a typo in the example code. The > `SingleStreamProcessFunction` should probably be > `OneInputStreamProcessFunction`. > > Yes, good catch. I have updated this FLIP. > > > 4. How do we set the global configuration for the > ExecutionEnvironment? Currently, we have the > StreamExecutionEnvironment.getExecutionEnvironment(Configuration) > method to provide the global configuration in the API. > > This is because we don't want to allow set config programmatically in the > new API, everything best comes from configuration files. However, this may > be too ideal, and the specific details need to be considered and discussed > in more detail, and I propose to devote a new sub-FLIP to this issue later. > We can easily provide the `getExecutionEnvironment(Configuration)` or > `withConfiguration(Configuration)` method later. > > > I noticed that there are two `collect` methods in the Collector, > one with a timestamp and one without. Could you elaborate on the > differences between them? Additionally, in what use case would one use > the method that includes the timestamp? > > That's a good question, and it's mostly used with time-related operators > such as Window. First, we want to give the process function the ability to > reset timestamps, which makes it more flexible than the original > API. Second, we don't want to take the timestamp extraction > operator/function as a base primitive, it's more like a high-level > extension. Therefore, the framework
[jira] [Created] (FLINK-34271) Fix the unstable test about GroupAggregateRestoreTest#AGG_WITH_STATE_TTL_HINT
xuyang created FLINK-34271: -- Summary: Fix the unstable test about GroupAggregateRestoreTest#AGG_WITH_STATE_TTL_HINT Key: FLINK-34271 URL: https://issues.apache.org/jira/browse/FLINK-34271 Project: Flink Issue Type: Bug Components: Table SQL / Planner Reporter: xuyang -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34270) Update connector developer-facing doc
Zhanghao Chen created FLINK-34270: - Summary: Update connector developer-facing doc Key: FLINK-34270 URL: https://issues.apache.org/jira/browse/FLINK-34270 Project: Flink Issue Type: Sub-task Reporter: Zhanghao Chen -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2
Thanks Yunfeng, Let me try to answer your question :) > 1. Would it be better to have all XXXPartitionStream classes implement ProcessConfigurable, instead of defining both XXXPartitionStream and ProcessConfigurableAndXXXPartitionStream? I wonder whether users would need to operate on a non-configurable PartitionStream. I thought about this for a while and decided to separate DataStream from ProcessConfigurable. At the core of this is that streams and configurations are completely orthogonal concepts, and configuration is only responsible for the `Process`, not the `Stream`. This is why only the ` process/connectAndProcess` returns configurable stream, but partitioning like `KeyBy` returns a pure DataStream. This may also answer your second question in passing. > Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) methods, would it be better to also add a general withConfig(configKey, configValue) method to the ProcessConfigurable interface? Adding a method for each configuration might harm the readability and compatibility of configurations. Sorry, I may not fully understand this question. ProcessConfigurable simply refers to the configuration of the Process, which can have the name, parallelism, etc of the process. It's not actually the Configuratiion(Contains a lot of ConfigOptions) that we usually talk about, but more like `SingleOutputStreamOperator` in DataStream V1. Best regards, Weijie Xuannan Su 于2024年1月29日周一 18:45写道: > Hi Weijie, > > Thanks for the FLIP! I have a few questions regarding the FLIP. > > 1. +1 to only use XXXParititionStream if users only need to use the > configurable PartitionStream. If there are use cases for both, > perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or > `ConfigurableNonKeyedPartitionStream` for simplicity. > > 2. Should we allow users to set custom configurations through the > `ProcessConfigurable` interface and access these configurations in the > `ProcessFunction` via `RuntimeContext`? I believe it would be useful > for process function developers to be able to define custom > configurations. > > 3. How can users define custom metrics within the `ProcessFunction`? > Will there be a method like `getMetricGroup` available in the > `RuntimeContext`? > > Best, > Xuannan > > > On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou > wrote: > > > > Hi Weijie, > > > > Thanks for introducing this FLIP! I have a few questions about the > > designs proposed. > > > > 1. Would it be better to have all XXXPartitionStream classes implement > > ProcessConfigurable, instead of defining both XXXPartitionStream and > > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would > > need to operate on a non-configurable PartitionStream. > > > > 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will > > there be classes other than XXXPartitionStream that implement this > > interface? Will "Process" be accurate enough to describe > > PartitionStream and those classes? > > > > 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) > > methods, would it be better to also add a general > > withConfig(configKey, configValue) method to the ProcessConfigurable > > interface? Adding a method for each configuration might harm the > > readability and compatibility of configurations. > > > > Looking forward to your response. > > > > Best regards, > > Yunfeng Zhou > > > > On Tue, Dec 26, 2023 at 2:47 PM weijie guo > wrote: > > > > > > Hi devs, > > > > > > > > > I'd like to start a discussion about FLIP-410: Config, Context and > > > Processing Timer Service of DataStream API V2 [1]. This is the second > > > sub-FLIP of DataStream API V2. > > > > > > > > > In FLIP-409 [2], we have defined the most basic primitive of > > > DataStream V2. On this basis, this FLIP will further answer several > > > important questions closely related to it: > > > > > >1. > > >How to configure the processing over the datastreams, such as > > > setting the parallelism. > > >2. > > >How to get access to the runtime contextual information and > > > services from inside the process functions. > > >3. How to work with processing-time timers. > > > > > > You can find more details in this FLIP. Its relationship with other > > > sub-FLIPs can be found in the umbrella FLIP > > > [3]. > > > > > > > > > Looking forward to hearing from you, thanks! > > > > > > > > > Best regards, > > > > > > Weijie > > > > > > > > > [1] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2 > > > > > > [2] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction > > > > > > [3] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2 >
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Hi Xuannan, Thank you for your attention. > In the partitioning section, it says that "broadcast can only be used as a side-input of other Inputs." Could you clarify what is meant by "side-input"? If I understand correctly, it refer to one of the inputs of the `TwoInputStreamProcessFunction`. If that is the case, the term "side-input" may not be accurate. Yes, you got it right! I have rewrote this sentence to avoid misunderstanding. > Is there a particular reason we do not support a `TwoInputProcessFunction` to combine a KeyedStream with a BroadcastStream to result in a KeyedStream? There seems to be a valid use case where a KeyedStream is enriched with a BroadcastStream and returns a Stream that is partitioned in the same way. The key point here is that if the returned stream is a KeyedStream, we require that the partition of input and output be the same. As for the data on the broadcast edge, it will be broadcast to all parallelism, we cannot keep the data partition consistent. For example, if a specific record is sent to both SubTask1 and SubTask2, after processing, the partition index calculated by the new KeySelector is `1`, then the data distribution of SubTask2 has obviously changed. > 3. There appears to be a typo in the example code. The `SingleStreamProcessFunction` should probably be `OneInputStreamProcessFunction`. Yes, good catch. I have updated this FLIP. > 4. How do we set the global configuration for the ExecutionEnvironment? Currently, we have the StreamExecutionEnvironment.getExecutionEnvironment(Configuration) method to provide the global configuration in the API. This is because we don't want to allow set config programmatically in the new API, everything best comes from configuration files. However, this may be too ideal, and the specific details need to be considered and discussed in more detail, and I propose to devote a new sub-FLIP to this issue later. We can easily provide the `getExecutionEnvironment(Configuration)` or `withConfiguration(Configuration)` method later. > I noticed that there are two `collect` methods in the Collector, one with a timestamp and one without. Could you elaborate on the differences between them? Additionally, in what use case would one use the method that includes the timestamp? That's a good question, and it's mostly used with time-related operators such as Window. First, we want to give the process function the ability to reset timestamps, which makes it more flexible than the original API. Second, we don't want to take the timestamp extraction operator/function as a base primitive, it's more like a high-level extension. Therefore, the framework must provide this functionality. Best regards, Weijie weijie guo 于2024年1月30日周二 11:45写道: > Hi Yunfeng, > > Thank you for your attention > > > 1. Will we provide any API to support choosing which input to consume > between the two inputs of TwoInputStreamProcessFunction? It would be > helpful in online machine learning cases, where a process function > needs to receive the first machine learning model before it can start > predictions on input data. Similar requirements might also exist in > Flink CEP, where a rule set needs to be consumed by the process > function first before it can start matching the event stream against > CEP patterns. > > Good point! I think we can provide a `nextInputSelection()` method for > `TwoInputStreamProcessFunction`. It returns a ·First/Second· enum that > determines which Input the mailbox thread will read next. But I'm > considering putting it in the sub-FLIP related to Join, since features like > HashJoin have a more specific need for this. > > > A typo might exist in the current FLIP describing the API to > generate a global stream, as I can see either global() or coalesce() > in different places of the FLIP. These two methods might need to be > unified into one method. > > Good catch! I have updated this FLIP to fix this typo. > > > The order of parameters in the current ProcessFunction is (record, > context, output), while this FLIP proposes to change the order into > (record, output, context). Is there any reason to make this change? > > No, it's just the order we decide. But please note that there is no > relationship between the two ProcessFunction's anyway. I think it's okay to > use our own order of parameters in new API. > > 4. Why does this FLIP propose to use connectAndProcess() instead of > connect() (+ keyBy()) + process()? The latter looks simpler to me. > > > I actually also considered this way at first, but it would have to > introduce some concepts like ConnectedStreams. But we hope that streams > will be more clearly defined in the DataStream API, otherwise we will end > up going the same way as the original API, which you have to understand > `JoinedStreams/ConnectedStreams` and so on. > > > > Best regards, > > Weijie > > > weijie guo 于2024年1月30日周二 11:20写道: > >> Hi Wencong: >> >> Thank you for your attention >> >> > Q1. Other DataStream
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Hi Yunfeng, Thank you for your attention > 1. Will we provide any API to support choosing which input to consume between the two inputs of TwoInputStreamProcessFunction? It would be helpful in online machine learning cases, where a process function needs to receive the first machine learning model before it can start predictions on input data. Similar requirements might also exist in Flink CEP, where a rule set needs to be consumed by the process function first before it can start matching the event stream against CEP patterns. Good point! I think we can provide a `nextInputSelection()` method for `TwoInputStreamProcessFunction`. It returns a ·First/Second· enum that determines which Input the mailbox thread will read next. But I'm considering putting it in the sub-FLIP related to Join, since features like HashJoin have a more specific need for this. > A typo might exist in the current FLIP describing the API to generate a global stream, as I can see either global() or coalesce() in different places of the FLIP. These two methods might need to be unified into one method. Good catch! I have updated this FLIP to fix this typo. > The order of parameters in the current ProcessFunction is (record, context, output), while this FLIP proposes to change the order into (record, output, context). Is there any reason to make this change? No, it's just the order we decide. But please note that there is no relationship between the two ProcessFunction's anyway. I think it's okay to use our own order of parameters in new API. 4. Why does this FLIP propose to use connectAndProcess() instead of connect() (+ keyBy()) + process()? The latter looks simpler to me. > I actually also considered this way at first, but it would have to introduce some concepts like ConnectedStreams. But we hope that streams will be more clearly defined in the DataStream API, otherwise we will end up going the same way as the original API, which you have to understand `JoinedStreams/ConnectedStreams` and so on. Best regards, Weijie weijie guo 于2024年1月30日周二 11:20写道: > Hi Wencong: > > Thank you for your attention > > > Q1. Other DataStream types are converted into > Non-Keyed DataStreams by using a "shuffle" operation > to convert Input into output. Does this "shuffle" include the > various repartition operations (rebalance/rescale/shuffle) > from DataStream V1? > > Yes, The name `shuffle` is used only to represent the transformation of an > arbitrary stream into a non-keyed partitioned stream and does not restrict > how the data is partitioned. > > > > Q2. Why is the design for TwoOutputStreamProcessFunction, > when dealing with a KeyedStream, only outputting combinations > of (Keyed + Keyed) and (Non-Keyed + Non-Keyed)? > > In theory, we could only provide functions that return Non-Keyed streams. > If you do want a KeyedStream, you explicitly convert it to a KeyedStream > via keyBy. However, because sometimes data is processed without changing > the partition, we choose to provide an additional KeyedStream counterpart > to reduce the shuffle overhead. We didn't introduce the non-keyed + keyed > combo here simply because it's not very common, and if we really see a lot > of users asking for it later on, it's easy to support it then. > > > Best regards, > > Weijie > > > Xuannan Su 于2024年1月29日周一 18:28写道: > >> Hi Weijie, >> >> Thank you for driving the design of the new DataStream API. I have a >> few questions regarding the FLIP: >> >> 1. In the partitioning section, it says that "broadcast can only be >> used as a side-input of other Inputs." Could you clarify what is meant >> by "side-input"? If I understand correctly, it refer to one of the >> inputs of the `TwoInputStreamProcessFunction`. If that is the case, >> the term "side-input" may not be accurate. >> >> 2. Is there a particular reason we do not support a >> `TwoInputProcessFunction` to combine a KeyedStream with a >> BroadcastStream to result in a KeyedStream? There seems to be a valid >> use case where a KeyedStream is enriched with a BroadcastStream and >> returns a Stream that is partitioned in the same way. >> >> 3. There appears to be a typo in the example code. The >> `SingleStreamProcessFunction` should probably be >> `OneInputStreamProcessFunction`. >> >> 4. How do we set the global configuration for the >> ExecutionEnvironment? Currently, we have the >> StreamExecutionEnvironment.getExecutionEnvironment(Configuration) >> method to provide the global configuration in the API. >> >> 5. I noticed that there are two `collect` methods in the Collector, >> one with a timestamp and one without. Could you elaborate on the >> differences between them? Additionally, in what use case would one use >> the method that includes the timestamp? >> >> Best regards, >> Xuannan >> >> >> >> On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou >> wrote: >> > >> > Hi Weijie, >> > >> > Thanks for raising discussions about the new DataStream API. I have a >> > few questions about the content of
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Hi Wencong: Thank you for your attention > Q1. Other DataStream types are converted into Non-Keyed DataStreams by using a "shuffle" operation to convert Input into output. Does this "shuffle" include the various repartition operations (rebalance/rescale/shuffle) from DataStream V1? Yes, The name `shuffle` is used only to represent the transformation of an arbitrary stream into a non-keyed partitioned stream and does not restrict how the data is partitioned. > Q2. Why is the design for TwoOutputStreamProcessFunction, when dealing with a KeyedStream, only outputting combinations of (Keyed + Keyed) and (Non-Keyed + Non-Keyed)? In theory, we could only provide functions that return Non-Keyed streams. If you do want a KeyedStream, you explicitly convert it to a KeyedStream via keyBy. However, because sometimes data is processed without changing the partition, we choose to provide an additional KeyedStream counterpart to reduce the shuffle overhead. We didn't introduce the non-keyed + keyed combo here simply because it's not very common, and if we really see a lot of users asking for it later on, it's easy to support it then. Best regards, Weijie Xuannan Su 于2024年1月29日周一 18:28写道: > Hi Weijie, > > Thank you for driving the design of the new DataStream API. I have a > few questions regarding the FLIP: > > 1. In the partitioning section, it says that "broadcast can only be > used as a side-input of other Inputs." Could you clarify what is meant > by "side-input"? If I understand correctly, it refer to one of the > inputs of the `TwoInputStreamProcessFunction`. If that is the case, > the term "side-input" may not be accurate. > > 2. Is there a particular reason we do not support a > `TwoInputProcessFunction` to combine a KeyedStream with a > BroadcastStream to result in a KeyedStream? There seems to be a valid > use case where a KeyedStream is enriched with a BroadcastStream and > returns a Stream that is partitioned in the same way. > > 3. There appears to be a typo in the example code. The > `SingleStreamProcessFunction` should probably be > `OneInputStreamProcessFunction`. > > 4. How do we set the global configuration for the > ExecutionEnvironment? Currently, we have the > StreamExecutionEnvironment.getExecutionEnvironment(Configuration) > method to provide the global configuration in the API. > > 5. I noticed that there are two `collect` methods in the Collector, > one with a timestamp and one without. Could you elaborate on the > differences between them? Additionally, in what use case would one use > the method that includes the timestamp? > > Best regards, > Xuannan > > > > On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou > wrote: > > > > Hi Weijie, > > > > Thanks for raising discussions about the new DataStream API. I have a > > few questions about the content of the FLIP. > > > > 1. Will we provide any API to support choosing which input to consume > > between the two inputs of TwoInputStreamProcessFunction? It would be > > helpful in online machine learning cases, where a process function > > needs to receive the first machine learning model before it can start > > predictions on input data. Similar requirements might also exist in > > Flink CEP, where a rule set needs to be consumed by the process > > function first before it can start matching the event stream against > > CEP patterns. > > > > 2. A typo might exist in the current FLIP describing the API to > > generate a global stream, as I can see either global() or coalesce() > > in different places of the FLIP. These two methods might need to be > > unified into one method. > > > > 3. The order of parameters in the current ProcessFunction is (record, > > context, output), while this FLIP proposes to change the order into > > (record, output, context). Is there any reason to make this change? > > > > 4. Why does this FLIP propose to use connectAndProcess() instead of > > connect() (+ keyBy()) + process()? The latter looks simpler to me. > > > > Looking forward to discussing these questions with you. > > > > Best regards, > > Yunfeng Zhou > > > > On Tue, Dec 26, 2023 at 2:44 PM weijie guo > wrote: > > > > > > Hi devs, > > > > > > > > > I'd like to start a discussion about FLIP-409: DataStream V2 Building > > > Blocks: DataStream, Partitioning and ProcessFunction [1]. > > > > > > > > > As the first sub-FLIP for DataStream API V2, we'd like to discuss and > > > try to answer some of the most fundamental questions in stream > > > processing: > > > > > >1. What kinds of data streams do we have? > > >2. How to partition data over the streams? > > >3. How to define a processing on the data stream? > > > > > > The answer to these questions involve three core concepts: DataStream, > > > Partitioning and ProcessFunction. In this FLIP, we will discuss the > > > definitions and related API primitives of these concepts in detail. > > > > > > > > > You can find more details in FLIP-409 [1]. This sub-FLIP is at the > > > heart of the entire
Re: Re: [DISCUSS] FLIP-331: Support EndOfStreamWindows and isOutputOnEOF operator attribute to optimize task deployment
Hi all, Thanks for the comments and suggestions. If there are no further comments, I will open the voting thread tomorrow. Best regards, Xuannan On Fri, Jan 26, 2024 at 3:16 PM Dong Lin wrote: > > Thanks Xuannan for the update! > > +1 (binding) > > On Wed, Jan 10, 2024 at 5:54 PM Xuannan Su wrote: > > > Hi all, > > > > After several rounds of offline discussions with Xingtong and Jinhao, > > we have decided to narrow the scope of the FLIP. It will now focus on > > introducing OperatorAttributes that indicate whether an operator emits > > records only after inputs have ended. We will also use the attribute > > to optimize task scheduling for better resource utilization. Setting > > the backlog status and optimizing the operator implementation during > > the backlog will be deferred to future work. > > > > In addition to the change above, we also make the following changes to > > the FLIP to address the problems mentioned by Dong: > > - Public interfaces are updated to reuse the GlobalWindows. > > - Instead of making all outputs of the upstream operators of the > > "isOutputOnlyAfterEndOfStream=true" operator blocking, we only make > > the output of the operator with "isOutputOnlyAfterEndOfStream=true" > > blocking. This can prevent the second problem Dong mentioned. In the > > future, we may introduce an extra OperatorAttributes to indicate if an > > operator has any side output. > > > > I would greatly appreciate any comment or feedback you may have on the > > updated FLIP. > > > > Best regards, > > Xuannan > > > > On Tue, Sep 26, 2023 at 11:24 AM Dong Lin wrote: > > > > > > Hi all, > > > > > > Thanks for the review! > > > > > > Becket and I discussed this FLIP offline and we agreed on several things > > > that need to be improved with this FLIP. I will summarize our discussion > > > with the problems and TODOs. We will update the FLIP and let you know > > once > > > the FLIP is ready for review again. > > > > > > 1) Investigate whether it is possible to update the existing > > GlobalWindows > > > in a backward-compatible way and re-use it for the same purpose > > > as EndOfStreamWindows, without introducing EndOfStreamWindows as a new > > > class. > > > > > > Note that GlobalWindows#getDefaultTrigger returns a NeverTrigger instance > > > which will not trigger window's computation even on end-of-inputs. We > > will > > > need to investigate its existing usage and see if we can re-use it in a > > > backward-compatible way. > > > > > > 2) Let JM know whether any operator in the upstream of the operator with > > > "isOutputOnEOF=true" will emit output via any side channel. The FLIP > > should > > > update the execution mode of those operators *only if* all outputs from > > > those operators are emitted only at the end of input. > > > > > > More specifically, the upstream operator might involve a user-defined > > > operator that might emit output directly to an external service, where > > the > > > emission operation is not explicitly expressed as an operator's output > > edge > > > and thus not visible to JM. Similarly, it is also possible for the > > > user-defined operator to register a timer > > > via InternalTimerService#registerEventTimeTimer and emit output to an > > > external service inside Triggerable#onEventTime. There is a chance that > > > users still need related logic to output data in real-time, even if the > > > downstream operators have isOutputOnEOF=true. > > > > > > One possible solution to address this problem is to add an extra > > > OperatorAttribute to specify whether this operator might output records > > in > > > such a way that does not go through operator's output (e.g. side output). > > > Then the JM can safely enable the runtime optimization currently > > described > > > in the FLIP when there is no such operator. > > > > > > 3) Create a follow-up FLIP that allows users to specify whether a source > > > with Boundedness=bounded should have isProcessingBacklog=true. > > > > > > This capability would effectively introduce a 3rd strategy to set backlog > > > status (in addition to FLIP-309 and FLIP-328). It might be useful to note > > > that, even though the data in bounded sources are backlog data in most > > > practical use-cases, it is not necessarily true. For example, users might > > > want to start a Flink job to consume real-time data from a Kafka topic > > and > > > specify that the job stops after 24 hours, which means the source is > > > technically bounded while the data is fresh/real-time. > > > > > > This capability is more generic and can cover more use-case than > > > EndOfStreamWindows. On the other hand, EndOfStreamWindows will still be > > > useful in cases where users already need to specify this window assigner > > in > > > a DataStream program, without bothering users to decide whether it is > > safe > > > to treat data in a bounded source as backlog data. > > > > > > > > > Regards, > > > Dong > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 18, 2023
Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2
Thanks for working on this, Weijie. The design flaws of the current DataStream API (i.e., V1) have been a pain for a long time. It's great to see efforts going on trying to resolve them. Significant changes to such an important and comprehensive set of public APIs deserves caution. From that perspective, the ideas of introducing a new set of APIs that gradually replace the current one, splitting the introducing of the new APIs into many separate FLIPs, and making intermediate APIs @Experiemental until all of them are completed make great sense to me. Besides, the ideas of generalized watermark, execution hints sound quite interesting. Looking forward to more detailed discussions in the corresponding sub-FLIPs. +1 for the roadmap. Best, Xintong On Tue, Jan 30, 2024 at 11:00 AM weijie guo wrote: > Hi Wencong: > > > The Processing TimerService is currently > defined as one of the basic primitives, partly because it's understood that > you have to choose between processing time and event time. > The other part of the reason is that it needs to work based on the task's > mailbox thread model to avoid concurrency issues. Could you clarify the > second > part of the reason? > > Since the processing logic of the operators takes place in the mailbox > thread, the processing timer's callback function must also be executed in > the mailbox to ensure thread safety. > If we do not define the Processing TimerService as primitive, there is no > way for the user to dispatch custom logic to the mailbox thread. > > > Best regards, > > Weijie > > > Xuannan Su 于2024年1月29日周一 17:12写道: > > > Hi Weijie, > > > > Thanks for driving the work! There are indeed many pain points in the > > current DataStream API, which are challenging to resolve with its > > existing design. It is a great opportunity to propose a new DataStream > > API that tackles these issues. I like the way we've divided the FLIP > > into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1 > > for the umbrella FLIP. I am eager to see the sub-FLIPs! > > > > Best regards, > > Xuannan > > > > > > > > > > On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu > wrote: > > > > > > Hi Weijie, > > > > > > > > > Thank you for the effort you've put into the DataStream API ! By > > reorganizing and > > > redesigning the DataStream API, as well as addressing some of the > > unreasonable > > > designs within it, we can enhance the efficiency of job development for > > developers. > > > It also allows developers to design more flexible Flink jobs to meet > > business requirements. > > > > > > > > > I have conducted a comprehensive review of the DataStream API design in > > versions > > > 1.18 and 1.19. I found quite a few functional defects in the DataStream > > API, such as the > > > lack of corresponding APIs in batch processing scenarios. In the > > upcoming 1.20 version, > > > I will further improve the DataStream API in batch computing scenarios. > > > > > > > > > The issues existing in the old DataStream API (which can be referred to > > as V1) can be > > > addressed from a design perspective in the initial version of V2. I > hope > > to also have the > > > opportunity to participate in the development of DataStream V2 and > make > > my contribution. > > > > > > > > > Regarding FLIP-408, I have a question: The Processing TimerService is > > currently > > > defined as one of the basic primitives, partly because it's understood > > that > > > you have to choose between processing time and event time. > > > The other part of the reason is that it needs to work based on the > task's > > > mailbox thread model to avoid concurrency issues. Could you clarify the > > second > > > part of the reason? > > > > > > Best, > > > Wencong Liu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-12-26 14:42:20, "weijie guo" > wrote: > > > >Hi devs, > > > > > > > > > > > >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce > > > >DataStream API V2 [1]. > > > > > > > > > > > >The DataStream API is one of the two main APIs that Flink provides for > > > >writing data processing programs. As an API that was introduced > > > >practically since day-1 of the project and has been evolved for nearly > > > >a decade, we are observing more and more problems of it. Improvements > > > >on these problems require significant breaking changes, which makes > > > >in-place refactor impractical. Therefore, we propose to introduce a > > > >new set of APIs, the DataStream API V2, to gradually replace the > > > >original DataStream API. > > > > > > > > > > > >The proposal to introduce a whole set new API is complex and includes > > > >massive changes. We are planning to break it down into multiple > > > >sub-FLIPs for incremental discussion. This FLIP is only used as an > > > >umbrella, mainly focusing on motivation, goals, and overall planning. > > > >That is to say, more design and implementation details will be > > > >discussed in
Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2
Hi Wencong: > The Processing TimerService is currently defined as one of the basic primitives, partly because it's understood that you have to choose between processing time and event time. The other part of the reason is that it needs to work based on the task's mailbox thread model to avoid concurrency issues. Could you clarify the second part of the reason? Since the processing logic of the operators takes place in the mailbox thread, the processing timer's callback function must also be executed in the mailbox to ensure thread safety. If we do not define the Processing TimerService as primitive, there is no way for the user to dispatch custom logic to the mailbox thread. Best regards, Weijie Xuannan Su 于2024年1月29日周一 17:12写道: > Hi Weijie, > > Thanks for driving the work! There are indeed many pain points in the > current DataStream API, which are challenging to resolve with its > existing design. It is a great opportunity to propose a new DataStream > API that tackles these issues. I like the way we've divided the FLIP > into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1 > for the umbrella FLIP. I am eager to see the sub-FLIPs! > > Best regards, > Xuannan > > > > > On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu wrote: > > > > Hi Weijie, > > > > > > Thank you for the effort you've put into the DataStream API ! By > reorganizing and > > redesigning the DataStream API, as well as addressing some of the > unreasonable > > designs within it, we can enhance the efficiency of job development for > developers. > > It also allows developers to design more flexible Flink jobs to meet > business requirements. > > > > > > I have conducted a comprehensive review of the DataStream API design in > versions > > 1.18 and 1.19. I found quite a few functional defects in the DataStream > API, such as the > > lack of corresponding APIs in batch processing scenarios. In the > upcoming 1.20 version, > > I will further improve the DataStream API in batch computing scenarios. > > > > > > The issues existing in the old DataStream API (which can be referred to > as V1) can be > > addressed from a design perspective in the initial version of V2. I hope > to also have the > > opportunity to participate in the development of DataStream V2 and make > my contribution. > > > > > > Regarding FLIP-408, I have a question: The Processing TimerService is > currently > > defined as one of the basic primitives, partly because it's understood > that > > you have to choose between processing time and event time. > > The other part of the reason is that it needs to work based on the task's > > mailbox thread model to avoid concurrency issues. Could you clarify the > second > > part of the reason? > > > > Best, > > Wencong Liu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-12-26 14:42:20, "weijie guo" wrote: > > >Hi devs, > > > > > > > > >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce > > >DataStream API V2 [1]. > > > > > > > > >The DataStream API is one of the two main APIs that Flink provides for > > >writing data processing programs. As an API that was introduced > > >practically since day-1 of the project and has been evolved for nearly > > >a decade, we are observing more and more problems of it. Improvements > > >on these problems require significant breaking changes, which makes > > >in-place refactor impractical. Therefore, we propose to introduce a > > >new set of APIs, the DataStream API V2, to gradually replace the > > >original DataStream API. > > > > > > > > >The proposal to introduce a whole set new API is complex and includes > > >massive changes. We are planning to break it down into multiple > > >sub-FLIPs for incremental discussion. This FLIP is only used as an > > >umbrella, mainly focusing on motivation, goals, and overall planning. > > >That is to say, more design and implementation details will be > > >discussed in other FLIPs. > > > > > > > > >Given that it's hard to imagine the detailed design of the new API if > > >we're just talking about this umbrella FLIP, and we probably won't be > > >able to give an opinion on it. Therefore, I have prepared two > > >sub-FLIPs [2][3] at the same time, and the discussion of them will be > > >posted later in separate threads. > > > > > > > > >Looking forward to hearing from you, thanks! > > > > > > > > >Best regards, > > > > > >Weijie > > > > > > > > > > > >[1] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2 > > > > > >[2] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction > > > > > > > > >[3] > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2 >
RE: [VOTE] Release flink-connector-jdbc, release candidate #2
+1(non-binding) - Release notes look good - Tag is present in Github - Validated checksum hash - Verified signature - Verified web PR and left minor comments Best, Jiabao On 2024/01/30 00:17:54 Sergey Nuyanzin wrote: > Hi everyone, > Please review and vote on the release candidate #2 for the version > 3.1.2, as follows: > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x. > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * the official Apache source release to be deployed to dist.apache.org > [2], which are signed with the key with fingerprint > 1596BBF0726835D8 [3], > * all artifacts to be deployed to the Maven Central Repository [4], > * source code tag v3.1.2-rc2 [5], > * website pull request listing the new release [6]. > > The vote will be open for at least 72 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > Thanks, > Release Manager > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088 > [2] > https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2 > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > [4] https://repository.apache.org/content/repositories/orgapacheflink-1704/ > [5] https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2 > [6] https://github.com/apache/flink-web/pull/707 >
[jira] [Created] (FLINK-34269) Sync maven-shade-plugin across modules in flink-shaded
Sergey Nuyanzin created FLINK-34269: --- Summary: Sync maven-shade-plugin across modules in flink-shaded Key: FLINK-34269 URL: https://issues.apache.org/jira/browse/FLINK-34269 Project: Flink Issue Type: Technical Debt Components: BuildSystem / Shaded Affects Versions: shaded-18.0 Reporter: Sergey Nuyanzin Assignee: Sergey Nuyanzin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[VOTE] Release flink-connector-jdbc, release candidate #2
Hi everyone, Please review and vote on the release candidate #2 for the version 3.1.2, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) This version is compatible with Flink 1.16.x, 1.17.x and 1.18.x. The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release to be deployed to dist.apache.org [2], which are signed with the key with fingerprint 1596BBF0726835D8 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag v3.1.2-rc2 [5], * website pull request listing the new release [6]. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Release Manager [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12354088 [2] https://dist.apache.org/repos/dist/dev/flink/flink-connector-jdbc-3.1.2-rc2 [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1704/ [5] https://github.com/apache/flink-connector-jdbc/releases/tag/v3.1.2-rc2 [6] https://github.com/apache/flink-web/pull/707
[jira] [Created] (FLINK-34268) Add a test to verify if restore test exists for ExecNode
Bonnie Varghese created FLINK-34268: --- Summary: Add a test to verify if restore test exists for ExecNode Key: FLINK-34268 URL: https://issues.apache.org/jira/browse/FLINK-34268 Project: Flink Issue Type: Sub-task Components: Table SQL / Planner Reporter: Bonnie Varghese Assignee: Bonnie Varghese -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34267) Python connector test fails when running on MacBook with m1 processor
Aleksandr Pilipenko created FLINK-34267: --- Summary: Python connector test fails when running on MacBook with m1 processor Key: FLINK-34267 URL: https://issues.apache.org/jira/browse/FLINK-34267 Project: Flink Issue Type: Bug Components: Connectors / Common Environment: m1 MacBook Pro MacOS 14.2.1 Reporter: Aleksandr Pilipenko Attempt to execute lint_python.sh on m1 macbook fails while trying to install miniconda environment {code} =installing environment= installing wget... install wget... [SUCCESS] installing miniconda... download miniconda... download miniconda... [SUCCESS] installing conda... tail: illegal offset -- +018838: Invalid argument tail: illegal offset -- +018838: Invalid argument /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/download/miniconda.sh: line 353: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/preconda.tar.bz2: No such file or directory upgrade pip... ./dev/lint-python.sh: line 215: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/python: No such file or directory upgrade pip... [SUCCESS] install conda ... [SUCCESS] install miniconda... [SUCCESS] installing python environment... installing python3.7... ./dev/lint-python.sh: line 247: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No such file or directory conda install 3.7 retrying 1/3 ./dev/lint-python.sh: line 254: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No such file or directory conda install 3.7 retrying 2/3 ./dev/lint-python.sh: line 254: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No such file or directory conda install 3.7 retrying 3/3 ./dev/lint-python.sh: line 254: /Users/apilipenko/Dev/flink-connector-aws/flink-python/dev/.conda/bin/conda: No such file or directory conda install 3.7 failed after retrying 3 times.You can retry to execute the script again. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Release flink-connector-parent, release candidate #1
- Inspected the source for licenses and corresponding headers - Checksums and signature OK +1 (binding) On Tue, Jan 23, 2024 at 4:08 PM Etienne Chauchot wrote: > > Hi everyone, > > Please review and vote on the release candidate #1 for the version > 1.1.0, as follows: > > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * the official Apache source release to be deployed to dist.apache.org > [2], which are signed with the key with fingerprint > D1A76BA19D6294DD0033F6843A019F0B8DD163EA [3], > * all artifacts to be deployed to the Maven Central Repository [4], > * source code tag v1.1.0-rc1 [5], > * website pull request listing the new release [6] > > * confluence wiki: connector parent upgrade to version 1.1.0 that will > be validated after the artifact is released (there is no PR mechanism on > the wiki) [7] > > The vote will be open for at least 72 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > Thanks, > > Etienne > > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353442 > [2] > https://dist.apache.org/repos/dist/dev/flink/flink-connector-parent-1.1.0-rc1 > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > [4] https://repository.apache.org/content/repositories/orgapacheflink-1698/ > [5] > https://github.com/apache/flink-connector-shared-utils/releases/tag/v1.1.0-rc1 > [6] https://github.com/apache/flink-web/pull/717 > > [7] > https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development
[jira] [Created] (FLINK-34266) Output ratios should be computed over the whole metric window instead of averaged
Gyula Fora created FLINK-34266: -- Summary: Output ratios should be computed over the whole metric window instead of averaged Key: FLINK-34266 URL: https://issues.apache.org/jira/browse/FLINK-34266 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Gyula Fora Currently Output ratios are computed during metric collection based on the current in/out metrics an stored as part of the collected metrics. During evaluation the output ratios previously computed are then averaged together in the metric window. This however leads to incorrect computation due to the nature of the computation and averaging. Example: Let's look at a window operator that simply sorts and re-emits events in windows. During the window collection phase, output ratio will be computed and stored as 0. During the window computation the output ratio will be last_input_rate / window_size. Depending on the last input rate observation this can be off when averaged into any direction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34265) Add the doc of named parameters
Feng Jin created FLINK-34265: Summary: Add the doc of named parameters Key: FLINK-34265 URL: https://issues.apache.org/jira/browse/FLINK-34265 Project: Flink Issue Type: Sub-task Components: Documentation, Table SQL / Planner Reporter: Feng Jin -- This message was sent by Atlassian Jira (v8.20.10#820010)
RE: [VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1
Thanks Leonard for driving this. +1(non-binding) - Release notes look good - Tag is present in Github - Validated checksum hash - Verified signature - Build the source with Maven by jdk8,11,17,21 - Verified web PR and left minor comments - Run a filter push down test by sql-client on Flink 1.18.1 and it works well Best, Jiabao On 2024/01/29 12:33:23 Leonard Xu wrote: > Hey all, > > Please help review and vote on the release candidate #1 for the version 1.1.0 > of the > Apache Flink MongoDB Connector as follows: > > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * The official Apache source release to be deployed to dist.apache.org [2], > which are signed with the key with fingerprint > 5B2F6608732389AEB67331F5B197E1F1108998AD [3], > * All artifacts to be deployed to the Maven Central Repository [4], > * Source code tag v.1.0-rc1 [5], > * Website pull request listing the new release [6]. > > The vote will be open for at least 72 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > > Best, > Leonard > [1] > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483 > [2] > https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/ > [3] https://dist.apache.org/repos/dist/release/flink/KEYS > [4] https://repository.apache.org/content/repositories/orgapacheflink-1702/ > [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1 > [6] https://github.com/apache/flink-web/pull/719
[jira] [Created] (FLINK-34264) Prometheuspushgateway's hosturl can use basicAuth certification login
Yangzhou Huang created FLINK-34264: -- Summary: Prometheuspushgateway's hosturl can use basicAuth certification login Key: FLINK-34264 URL: https://issues.apache.org/jira/browse/FLINK-34264 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Affects Versions: 1.19.0 Reporter: Yangzhou Huang Fix For: 1.19.0 The scene is as follows: Pushgateway uses Basicauth to verify, while the current code does not implement permissions on Basicauth. hostUrl eg: [https://username:password@localhost:9091|https://username:password@localhost:9091/] If this proposal is approved, I will propose a PR improvement. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API
+1 (binding) On Mon, Jan 29, 2024 at 5:45 AM Maximilian Michels wrote: > +1 (binding) > > On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote: > > > > +1(binding) > > > > Best, > > Rui > > > > On Fri, Jan 26, 2024 at 11:55 AM Xuyang wrote: > > > > > +1 (non-binding) > > > > > > > > > -- > > > > > > Best! > > > Xuyang > > > > > > > > > > > > > > > > > > 在 2024-01-26 10:12:34,"Hang Ruan" 写道: > > > >Thanks for the FLIP. > > > > > > > >+1 (non-binding) > > > > > > > >Best, > > > >Hang > > > > > > > >Mason Chen 于2024年1月26日周五 04:51写道: > > > > > > > >> Hi Devs, > > > >> > > > >> I would like to start a vote on FLIP-417: Expose > > > JobManagerOperatorMetrics > > > >> via REST API [1] which has been discussed in this thread [2]. > > > >> > > > >> The vote will be open for at least 72 hours unless there is an > > > objection or > > > >> not enough votes. > > > >> > > > >> [1] > > > >> > > > >> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API > > > >> [2] > https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0 > > > >> > > > >> Best, > > > >> Mason > > > >> > > > >
[jira] [Created] (FLINK-34263) Converting double to decimal may fail
Caican Cai created FLINK-34263: -- Summary: Converting double to decimal may fail Key: FLINK-34263 URL: https://issues.apache.org/jira/browse/FLINK-34263 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.18.1 Reporter: Caican Cai Fix For: 1.18.1 Converting double to decimal fails. When the value is infinity, an error will be reported when converting double to decimal. {code:java} /* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.flink.table.examples.java.basics; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.TableEnvironment; /** The famous word count example that shows a minimal Flink SQL job in batch execution mode. */ public final class WordCountSQLExample { public static void main(String[] args) throws Exception { // set up the Table API final EnvironmentSettings settings = EnvironmentSettings.newInstance().inBatchMode().build(); final TableEnvironment tableEnv = TableEnvironment.create(settings); // execute a Flink SQL job and print the result locally tableEnv.executeSql( // define the aggregation "SELECT CAST(power(0,-3) AS DECIMAL), word, SUM(frequency) AS `count`\n" // read from an artificial fixed-size table with rows and columns + "FROM (\n" + " VALUES ('Hello', 1), ('Ciao', 1), ('Hello', 2)\n" + ")\n" // name the table and its columns + "AS WordTable(word, frequency)\n" // group for aggregation + "GROUP BY word") .print(); } } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[VOTE] Release flink-connector-mongodb 1.1.0, release candidate #1
Hey all, Please help review and vote on the release candidate #1 for the version 1.1.0 of the Apache Flink MongoDB Connector as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * The official Apache source release to be deployed to dist.apache.org [2], which are signed with the key with fingerprint 5B2F6608732389AEB67331F5B197E1F1108998AD [3], * All artifacts to be deployed to the Maven Central Repository [4], * Source code tag v.1.0-rc1 [5], * Website pull request listing the new release [6]. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Best, Leonard [1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353483 [2] https://dist.apache.org/repos/dist/dev/flink/flink-connector-mongodb-1.1.0-rc1/ [3] https://dist.apache.org/repos/dist/release/flink/KEYS [4] https://repository.apache.org/content/repositories/orgapacheflink-1702/ [5] https://github.com/apache/flink-connector-mongodb/tree/v1.1.0-rc1 [6] https://github.com/apache/flink-web/pull/719
Re: [DISCUSS] FLIP-410: Config, Context and Processing Timer Service of DataStream API V2
Hi Weijie, Thanks for the FLIP! I have a few questions regarding the FLIP. 1. +1 to only use XXXParititionStream if users only need to use the configurable PartitionStream. If there are use cases for both, perhaps we could use `ProcessConfigurableNonKeyedPartitionStream` or `ConfigurableNonKeyedPartitionStream` for simplicity. 2. Should we allow users to set custom configurations through the `ProcessConfigurable` interface and access these configurations in the `ProcessFunction` via `RuntimeContext`? I believe it would be useful for process function developers to be able to define custom configurations. 3. How can users define custom metrics within the `ProcessFunction`? Will there be a method like `getMetricGroup` available in the `RuntimeContext`? Best, Xuannan On Fri, Jan 26, 2024 at 2:38 PM Yunfeng Zhou wrote: > > Hi Weijie, > > Thanks for introducing this FLIP! I have a few questions about the > designs proposed. > > 1. Would it be better to have all XXXPartitionStream classes implement > ProcessConfigurable, instead of defining both XXXPartitionStream and > ProcessConfigurableAndXXXPartitionStream? I wonder whether users would > need to operate on a non-configurable PartitionStream. > > 2. The name "ProcessConfigurable" seems a little ambiguous to me. Will > there be classes other than XXXPartitionStream that implement this > interface? Will "Process" be accurate enough to describe > PartitionStream and those classes? > > 3. Apart from the detailed withConfigFoo(foo)/withConfigBar(bar) > methods, would it be better to also add a general > withConfig(configKey, configValue) method to the ProcessConfigurable > interface? Adding a method for each configuration might harm the > readability and compatibility of configurations. > > Looking forward to your response. > > Best regards, > Yunfeng Zhou > > On Tue, Dec 26, 2023 at 2:47 PM weijie guo wrote: > > > > Hi devs, > > > > > > I'd like to start a discussion about FLIP-410: Config, Context and > > Processing Timer Service of DataStream API V2 [1]. This is the second > > sub-FLIP of DataStream API V2. > > > > > > In FLIP-409 [2], we have defined the most basic primitive of > > DataStream V2. On this basis, this FLIP will further answer several > > important questions closely related to it: > > > >1. > >How to configure the processing over the datastreams, such as > > setting the parallelism. > >2. > >How to get access to the runtime contextual information and > > services from inside the process functions. > >3. How to work with processing-time timers. > > > > You can find more details in this FLIP. Its relationship with other > > sub-FLIPs can be found in the umbrella FLIP > > [3]. > > > > > > Looking forward to hearing from you, thanks! > > > > > > Best regards, > > > > Weijie > > > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2 > > > > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction > > > > [3] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
Re: Re: [VOTE] FLIP-417: Expose JobManagerOperatorMetrics via REST API
+1 (binding) On Fri, Jan 26, 2024 at 6:03 AM Rui Fan <1996fan...@gmail.com> wrote: > > +1(binding) > > Best, > Rui > > On Fri, Jan 26, 2024 at 11:55 AM Xuyang wrote: > > > +1 (non-binding) > > > > > > -- > > > > Best! > > Xuyang > > > > > > > > > > > > 在 2024-01-26 10:12:34,"Hang Ruan" 写道: > > >Thanks for the FLIP. > > > > > >+1 (non-binding) > > > > > >Best, > > >Hang > > > > > >Mason Chen 于2024年1月26日周五 04:51写道: > > > > > >> Hi Devs, > > >> > > >> I would like to start a vote on FLIP-417: Expose > > JobManagerOperatorMetrics > > >> via REST API [1] which has been discussed in this thread [2]. > > >> > > >> The vote will be open for at least 72 hours unless there is an > > objection or > > >> not enough votes. > > >> > > >> [1] > > >> > > >> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-417%3A+Expose+JobManagerOperatorMetrics+via+REST+API > > >> [2] https://lists.apache.org/thread/tt0hf6kf5lcxd7g62v9dhpn3z978pxw0 > > >> > > >> Best, > > >> Mason > > >> > >
Re: [DISCUSS] FLIP-409: DataStream V2 Building Blocks: DataStream, Partitioning and ProcessFunction
Hi Weijie, Thank you for driving the design of the new DataStream API. I have a few questions regarding the FLIP: 1. In the partitioning section, it says that "broadcast can only be used as a side-input of other Inputs." Could you clarify what is meant by "side-input"? If I understand correctly, it refer to one of the inputs of the `TwoInputStreamProcessFunction`. If that is the case, the term "side-input" may not be accurate. 2. Is there a particular reason we do not support a `TwoInputProcessFunction` to combine a KeyedStream with a BroadcastStream to result in a KeyedStream? There seems to be a valid use case where a KeyedStream is enriched with a BroadcastStream and returns a Stream that is partitioned in the same way. 3. There appears to be a typo in the example code. The `SingleStreamProcessFunction` should probably be `OneInputStreamProcessFunction`. 4. How do we set the global configuration for the ExecutionEnvironment? Currently, we have the StreamExecutionEnvironment.getExecutionEnvironment(Configuration) method to provide the global configuration in the API. 5. I noticed that there are two `collect` methods in the Collector, one with a timestamp and one without. Could you elaborate on the differences between them? Additionally, in what use case would one use the method that includes the timestamp? Best regards, Xuannan On Fri, Jan 26, 2024 at 2:21 PM Yunfeng Zhou wrote: > > Hi Weijie, > > Thanks for raising discussions about the new DataStream API. I have a > few questions about the content of the FLIP. > > 1. Will we provide any API to support choosing which input to consume > between the two inputs of TwoInputStreamProcessFunction? It would be > helpful in online machine learning cases, where a process function > needs to receive the first machine learning model before it can start > predictions on input data. Similar requirements might also exist in > Flink CEP, where a rule set needs to be consumed by the process > function first before it can start matching the event stream against > CEP patterns. > > 2. A typo might exist in the current FLIP describing the API to > generate a global stream, as I can see either global() or coalesce() > in different places of the FLIP. These two methods might need to be > unified into one method. > > 3. The order of parameters in the current ProcessFunction is (record, > context, output), while this FLIP proposes to change the order into > (record, output, context). Is there any reason to make this change? > > 4. Why does this FLIP propose to use connectAndProcess() instead of > connect() (+ keyBy()) + process()? The latter looks simpler to me. > > Looking forward to discussing these questions with you. > > Best regards, > Yunfeng Zhou > > On Tue, Dec 26, 2023 at 2:44 PM weijie guo wrote: > > > > Hi devs, > > > > > > I'd like to start a discussion about FLIP-409: DataStream V2 Building > > Blocks: DataStream, Partitioning and ProcessFunction [1]. > > > > > > As the first sub-FLIP for DataStream API V2, we'd like to discuss and > > try to answer some of the most fundamental questions in stream > > processing: > > > >1. What kinds of data streams do we have? > >2. How to partition data over the streams? > >3. How to define a processing on the data stream? > > > > The answer to these questions involve three core concepts: DataStream, > > Partitioning and ProcessFunction. In this FLIP, we will discuss the > > definitions and related API primitives of these concepts in detail. > > > > > > You can find more details in FLIP-409 [1]. This sub-FLIP is at the > > heart of the entire DataStream API V2, and its relationship with other > > sub-FLIPs can be found in the umbrella FLIP [2]. > > > > > > Looking forward to hearing from you, thanks! > > > > > > Best regards, > > > > Weijie > > > > > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction > > > > [2] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2
Re:[VOTE] FLIP-418: Show data skew score on Flink Dashboard
At 2024-01-29 18:09:10, "Kartoglu, Emre" wrote: >Hello, > >I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard. > >FLIP: >https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard >Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql > >Kind regards, >Emre >
[VOTE] FLIP-418: Show data skew score on Flink Dashboard
Hello, I'd like to call votes on FLIP-418: Show data skew score on Flink Dashboard. FLIP: https://cwiki.apache.org/confluence/display/FLINK/FLIP-418%3A+Show+data+skew+score+on+Flink+Dashboard Discussion: https://lists.apache.org/thread/m5ockoork0h2zr78h77dcrn71rbt35ql Kind regards, Emre
[jira] [Created] (FLINK-34262) Support to set user specified labels for Rest Service Object
Prabhu Joseph created FLINK-34262: - Summary: Support to set user specified labels for Rest Service Object Key: FLINK-34262 URL: https://issues.apache.org/jira/browse/FLINK-34262 Project: Flink Issue Type: Improvement Components: Deployment / Kubernetes Affects Versions: 1.18.1 Reporter: Prabhu Joseph Flink allows users to label JM and TM pods; the rest service object also requires labeling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34261) When slotmanager.redundant-taskmanager-num is set to a value greater than 1, redundant task managers may be repeatedly released and requested
junzhong qin created FLINK-34261: Summary: When slotmanager.redundant-taskmanager-num is set to a value greater than 1, redundant task managers may be repeatedly released and requested Key: FLINK-34261 URL: https://issues.apache.org/jira/browse/FLINK-34261 Project: Flink Issue Type: Bug Reporter: junzhong qin Attachments: image-2024-01-29-17-29-15-453.png Redundant task managers are extra task managers started by Flink, to speed up job recovery in case of failures due to task manager lost. But when we configured {code:java} slotmanager.redundant-taskmanager-num: 2 // any value greater than 1{code} Flink will release and request redundant TM repeatedly. We can reproduce this situation by using [Flink Kubernetes Operator (using minikube here)|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/], here is an example yaml file: {code:java} // redundant-tm.yaml # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: name: redundant-tm spec: image: flink:1.18 flinkVersion: v1_18 flinkConfiguration: taskmanager.numberOfTaskSlots: "2" slotmanager.redundant-taskmanager-num: "2" cluster.fine-grained-resource-management.enabled: "false" serviceAccount: flink jobManager: resource: memory: "1024m" cpu: 1 taskManager: resource: memory: "1024m" cpu: 1 job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 3 upgradeMode: stateless logConfiguration: log4j-console.properties: |+ rootLogger.level = DEBUG rootLogger.appenderRef.file.ref = LogFile rootLogger.appenderRef.console.ref = LogConsole appender.file.name = LogFile appender.file.type = File appender.file.append = false appender.file.fileName = ${sys:log.file} appender.file.layout.type = PatternLayout appender.file.layout.pattern = %d{-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n{code} After executing: {code:java} kubectl create -f redundant-tm.yaml kubectl port-forward svc/redundant-tm 8081{code} We can find repeatedly release and request redundant TM in JM's log: {code:java} 2024-01-29 09:26:25,033 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Registering TaskManager with ResourceID redundant-tm-taskmanager-1-4 (pekko.tcp://flink@10.244.1.196:6122/user/rpc/taskmanager_0) at ResourceManager2024-01-29 09:26:25,060 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Registering task executor redundant-tm-taskmanager-1-4 under 44c649b2d84e87cdd5e6c53971f8b877 at the slot manager.2024-01-29 09:26:25,061 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker redundant-tm-taskmanager-1-4 is registered.2024-01-29 09:26:25,061 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker redundant-tm-taskmanager-1-4 with resource spec WorkerResourceSpec {cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=2} was requested in current attempt. Current pending count after registering: 1.2024-01-29 09:26:25,196 DEBUG org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Update resource declarations to [ResourceDeclaration{spec=WorkerResourceSpec {cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=2}, numNeeded=3, unwantedWorkers=[]}].2024-01-29 09:26:25,196 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need release 1 workers, current worker number 4, declared worker number 32024-01-29 09:26:25,199 INFO
Re: [DISCUSS] FLIP-408: [Umbrella] Introduce DataStream API V2
Hi Weijie, Thanks for driving the work! There are indeed many pain points in the current DataStream API, which are challenging to resolve with its existing design. It is a great opportunity to propose a new DataStream API that tackles these issues. I like the way we've divided the FLIP into multiple sub-FLIPs; the roadmap is clear and comprehensible. +1 for the umbrella FLIP. I am eager to see the sub-FLIPs! Best regards, Xuannan On Wed, Jan 24, 2024 at 8:55 PM Wencong Liu wrote: > > Hi Weijie, > > > Thank you for the effort you've put into the DataStream API ! By reorganizing > and > redesigning the DataStream API, as well as addressing some of the unreasonable > designs within it, we can enhance the efficiency of job development for > developers. > It also allows developers to design more flexible Flink jobs to meet business > requirements. > > > I have conducted a comprehensive review of the DataStream API design in > versions > 1.18 and 1.19. I found quite a few functional defects in the DataStream API, > such as the > lack of corresponding APIs in batch processing scenarios. In the upcoming > 1.20 version, > I will further improve the DataStream API in batch computing scenarios. > > > The issues existing in the old DataStream API (which can be referred to as > V1) can be > addressed from a design perspective in the initial version of V2. I hope to > also have the > opportunity to participate in the development of DataStream V2 and make my > contribution. > > > Regarding FLIP-408, I have a question: The Processing TimerService is > currently > defined as one of the basic primitives, partly because it's understood that > you have to choose between processing time and event time. > The other part of the reason is that it needs to work based on the task's > mailbox thread model to avoid concurrency issues. Could you clarify the second > part of the reason? > > Best, > Wencong Liu > > > > > > > > > > > > > > > At 2023-12-26 14:42:20, "weijie guo" wrote: > >Hi devs, > > > > > >I'd like to start a discussion about FLIP-408: [Umbrella] Introduce > >DataStream API V2 [1]. > > > > > >The DataStream API is one of the two main APIs that Flink provides for > >writing data processing programs. As an API that was introduced > >practically since day-1 of the project and has been evolved for nearly > >a decade, we are observing more and more problems of it. Improvements > >on these problems require significant breaking changes, which makes > >in-place refactor impractical. Therefore, we propose to introduce a > >new set of APIs, the DataStream API V2, to gradually replace the > >original DataStream API. > > > > > >The proposal to introduce a whole set new API is complex and includes > >massive changes. We are planning to break it down into multiple > >sub-FLIPs for incremental discussion. This FLIP is only used as an > >umbrella, mainly focusing on motivation, goals, and overall planning. > >That is to say, more design and implementation details will be > >discussed in other FLIPs. > > > > > >Given that it's hard to imagine the detailed design of the new API if > >we're just talking about this umbrella FLIP, and we probably won't be > >able to give an opinion on it. Therefore, I have prepared two > >sub-FLIPs [2][3] at the same time, and the discussion of them will be > >posted later in separate threads. > > > > > >Looking forward to hearing from you, thanks! > > > > > >Best regards, > > > >Weijie > > > > > > > >[1] > >https://cwiki.apache.org/confluence/display/FLINK/FLIP-408%3A+%5BUmbrella%5D+Introduce+DataStream+API+V2 > > > >[2] > >https://cwiki.apache.org/confluence/display/FLINK/FLIP-409%3A+DataStream+V2+Building+Blocks%3A+DataStream%2C+Partitioning+and+ProcessFunction > > > > > >[3] > >https://cwiki.apache.org/confluence/display/FLINK/FLIP-410%3A++Config%2C+Context+and+Processing+Timer+Service+of+DataStream+API+V2
[DISCUSS] Drop support for HBase v1
Hi all, While working on adding support for Flink 1.19 for HBase, we've run into a dependency convergence issue because HBase v1 relies on a really old version of Guava. HBase v2 has been made available since May 2018, and there have been no new releases of HBase v1 since August 2022. I would like to propose that the Flink HBase connector drops support for HBase v1, and will only continue HBase v2 in the future. I don't think this requires a full FLIP and vote, but I do want to start a discussion thread for this. Best regards, Martijn
[jira] [Created] (FLINK-34260) Make flink-connector-aws compatible with SinkV2 changes
Martijn Visser created FLINK-34260: -- Summary: Make flink-connector-aws compatible with SinkV2 changes Key: FLINK-34260 URL: https://issues.apache.org/jira/browse/FLINK-34260 Project: Flink Issue Type: Bug Components: Connectors / AWS Affects Versions: aws-connector-4.3.0 Reporter: Martijn Visser https://github.com/apache/flink-connector-aws/actions/runs/7689300085/job/20951547366#step:9:798 {code:java} Error: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project flink-connector-dynamodb: Compilation failure Error: /home/runner/work/flink-connector-aws/flink-connector-aws/flink-connector-aws/flink-connector-dynamodb/src/test/java/org/apache/flink/connector/dynamodb/sink/DynamoDbSinkWriterTest.java:[357,40] incompatible types: org.apache.flink.connector.base.sink.writer.TestSinkInitContext cannot be converted to org.apache.flink.api.connector.sink2.Sink.InitContext {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34259) flink-connector-jdbc fails to compile with NPE on hasGenericTypesDisabled
Martijn Visser created FLINK-34259: -- Summary: flink-connector-jdbc fails to compile with NPE on hasGenericTypesDisabled Key: FLINK-34259 URL: https://issues.apache.org/jira/browse/FLINK-34259 Project: Flink Issue Type: Bug Components: Connectors / JDBC Reporter: Martijn Visser https://github.com/apache/flink-connector-jdbc/actions/runs/7682035724/job/20935884874#step:14:150 {code:java} Error: Tests run: 10, Failures: 5, Errors: 4, Skipped: 0, Time elapsed: 7.909 s <<< FAILURE! - in org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest Error: org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest.testInvalidConnectionInJdbcOutputFormat Time elapsed: 3.254 s <<< ERROR! java.lang.NullPointerException: Cannot invoke "org.apache.flink.api.common.serialization.SerializerConfig.hasGenericTypesDisabled()" because "config" is null at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:85) at org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:99) at org.apache.flink.connector.jdbc.JdbcTestBase.getSerializer(JdbcTestBase.java:70) at org.apache.flink.connector.jdbc.JdbcRowOutputFormatTest.testInvalidConnectionInJdbcOutputFormat(JdbcRowOutputFormatTest.java:336) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) {code} Seems to be caused by FLINK-34122 -- This message was sent by Atlassian Jira (v8.20.10#820010)