[jira] [Created] (FLINK-30110) Enable from-timestamp log scan when timestamp-millis is configured
Jingsong Lee created FLINK-30110: Summary: Enable from-timestamp log scan when timestamp-millis is configured Key: FLINK-30110 URL: https://issues.apache.org/jira/browse/FLINK-30110 Project: Flink Issue Type: Improvement Components: Table Store Reporter: Jingsong Lee Fix For: table-store-0.3.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30109) Checked exceptions are sneakingly transformed into unchecked exceptions in the Pulsar
Matthias Pohl created FLINK-30109: - Summary: Checked exceptions are sneakingly transformed into unchecked exceptions in the Pulsar Key: FLINK-30109 URL: https://issues.apache.org/jira/browse/FLINK-30109 Project: Flink Issue Type: Technical Debt Components: Connectors / Pulsar, Documentation Affects Versions: 1.15.2, 1.16.0, 1.17.0 Reporter: Matthias Pohl [PulsarExceptionUtils|https://github.com/apache/flink/blob/c675f786c51038801161e861826d1c54654f0dde/flink-connectors/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/common/utils/PulsarExceptionUtils.java#L33] provides {{sneaky*}} utility methods for hiding checked exceptions. This is rather unusual coding. Based on what's provided in the code I would have concerns as a reader that we're not handling errors properly in calling code. Either, we remove these methods and add proper exception handling or we add proper documentation on why this workaround is necessary. [~syhily] already hinted in his [FLINK-29830 PR comment|https://github.com/apache/flink/pull/21252#discussion_r1019822514] that this is related to flaws of the Pulsar API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30108) flink-core module tests exited with code 143
Leonard Xu created FLINK-30108: -- Summary: flink-core module tests exited with code 143 Key: FLINK-30108 URL: https://issues.apache.org/jira/browse/FLINK-30108 Project: Flink Issue Type: Bug Components: API / Core, Tests Affects Versions: 1.17.0 Reporter: Leonard Xu {noformat} Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 109.22 s - in org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest Nov 18 01:18:09 == Nov 18 01:18:09 Process produced no output for 900 seconds. Nov 18 01:18:09 == Nov 18 01:18:09 == Nov 18 01:18:09 The following Java processes are running (JPS) Nov 18 01:18:09 == Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError Nov 18 01:18:09 924 Launcher Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar Nov 18 01:18:09 11885 Jps Nov 18 01:18:09 == Nov 18 01:18:09 Printing stack trace of Java process 924 Nov 18 01:18:09 == Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError Nov 18 01:18:09 2022-11-18 01:18:09 Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed mode): ... ... ... Nov 18 01:18:09 == Nov 18 01:18:09 Printing stack trace of Java process 11885 Nov 18 01:18:09 == 11885: No such process Nov 18 01:18:09 Killing process with pid=923 and all descendants /__w/2/s/tools/ci/watchdog.sh: line 113: 923 Terminated $cmd Nov 18 01:18:10 Process exited with EXIT CODE: 143. Nov 18 01:18:10 Trying to KILL watchdog (919). Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in '/__w/2/s' Nov 18 01:18:16 Moving '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream' to target directory ('/__w/_temp/debug_files') Nov 18 01:18:16 Moving '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump' to target directory ('/__w/_temp/debug_files') The STDIO streams did not close within 10 seconds of the exit event from process '/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited. ##[error]Bash exited with code '143'. Finishing: Test - core {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30107) ChangelogRecoveryITCase#testMaterialization
Leonard Xu created FLINK-30107: -- Summary: ChangelogRecoveryITCase#testMaterialization Key: FLINK-30107 URL: https://issues.apache.org/jira/browse/FLINK-30107 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.16.0 Reporter: Leonard Xu {noformat} Nov 18 06:18:39 [ERROR] Errors: Nov 18 06:18:39 [ERROR] ChangelogRecoveryITCase.testMaterialization Nov 18 06:18:39 [INFO] Run 1: PASS Nov 18 06:18:39 [ERROR] Run 2: org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2, backoffTimeMS=0) Nov 18 06:18:39 [INFO] Run 3: PASS Nov 18 06:18:39 [INFO] Nov 18 06:18:39 [INFO] {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43279=logs=4d4a0d10-fca2-5507-8eed-c07f0bdf4887=7b25afdf-cc6c-566f-5459-359dc2585798 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30106) Build python wheels on macs failed due to install for crcmod error
Leonard Xu created FLINK-30106: -- Summary: Build python wheels on macs failed due to install for crcmod error Key: FLINK-30106 URL: https://issues.apache.org/jira/browse/FLINK-30106 Project: Flink Issue Type: Bug Components: API / Python, Build System / CI Affects Versions: 1.16.0 Reporter: Leonard Xu {noformat} note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for crcmod Running setup.py clean for crcmod error: subprocess-exited-with-error × python setup.py clean did not run successfully. │ exit code: 1 ╰─> [14 lines of output] error: Multiple top-level packages discovered in a flat-layout: ['python3', 'python2']. To avoid accidental inclusion of unwanted files or directories, setuptools will not proceed with this build. If you are trying to create a single distribution with multiple packages on purpose, you should not rely on automatic discovery. Instead, consider the following options: 1. set up custom discovery (`find` directive with `include` or `exclude`) 2. use a `src-layout` 3. explicitly set `py_modules` or `packages` with a list of names To find more information, look for "package discovery" on setuptools docs. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed cleaning build dir for crcmod Building wheel for dill (setup.py): started Building wheel for dill (setup.py): finished with status 'done' Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl size=78544 sha256=9ce160b3c3e2e1dcd24e136c59fb84ef9bd072b4715c6b34536d3ff9c39a5962 {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43279=logs=f73b5736-8355-5390-ec71-4dfdec0ce6c5=90f7230e-bf5a-531b-8566-ad48d3e03bbb -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30105) SubtaskExecutionAttemptAccumulatorsInfoTest failed with JVM exit code 239
Qingsheng Ren created FLINK-30105: - Summary: SubtaskExecutionAttemptAccumulatorsInfoTest failed with JVM exit code 239 Key: FLINK-30105 URL: https://issues.apache.org/jira/browse/FLINK-30105 Project: Flink Issue Type: Bug Components: Runtime / REST Affects Versions: 1.17.0 Reporter: Qingsheng Ren [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43124=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=8164] {code:java} Nov 14 08:08:38 [ERROR] org.apache.flink.runtime.rest.messages.job.SubtaskExecutionAttemptAccumulatorsInfoTest Nov 14 08:08:38 [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Nov 14 08:08:38 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m -jar /__w/1/s/flink-runtime/target/surefire/surefirebooter285048696362473297.jar /__w/1/s/flink-runtime/target/surefire 2022-11-14T08-03-00_124-jvmRun3 surefire3531696916359342131tmp surefire_26906836589585616499tmp Nov 14 08:08:38 [ERROR] Error occurred in starting fork, check output in log Nov 14 08:08:38 [ERROR] Process Exit Code: 239 Nov 14 08:08:38 [ERROR] Crashed tests: Nov 14 08:08:38 [ERROR] org.apache.flink.runtime.rest.messages.job.SubtaskExecutionAttemptAccumulatorsInfoTest Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932) Nov 14 08:08:38 [ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) Nov 14 08:08:38 [ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120) Nov 14 08:08:38 [ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355) Nov 14 08:08:38 [ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155) Nov 14 08:08:38 [ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) Nov 14 08:08:38 [ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216) Nov 14 08:08:38 [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160) Nov 14 08:08:38 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Nov 14 08:08:38 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) Nov 14 08:08:38 [ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) Nov 14 08:08:38 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498) Nov 14 08:08:38 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) Nov 14 08:08:38 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) Nov 14 08:08:38 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) Nov 14 08:08:38 [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Nov 14 08:08:38 [ERROR] Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called? Nov 14 08:08:38 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
[jira] [Created] (FLINK-30104) Snapshot deployment failed due to Could not transfer artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 from/to apache.snapshots
Leonard Xu created FLINK-30104: -- Summary: Snapshot deployment failed due to Could not transfer artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 from/to apache.snapshots Key: FLINK-30104 URL: https://issues.apache.org/jira/browse/FLINK-30104 Project: Flink Issue Type: Bug Components: Build System / CI Affects Versions: 1.15.2 Reporter: Leonard Xu {noformat} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on project flink-runtime-web: Failed to deploy artifacts: Could not transfer artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): Failed to transfer file: https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime-web/1.15-SNAPSHOT/flink-runtime-web-1.15-20221119.010836-325.jar. Return code is: 502, ReasonPhrase: Proxy Error. -> [Help 1] {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43308=logs=eca6b3a6-1600-56cc-916a-c549b3cde3ff=e9844b5e-5aa3-546b-6c3e-5395c7c0cac7 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30103) Test InputFormatCacheLoaderTest.checkCounter failed due to unexpected value on azure
Leonard Xu created FLINK-30103: -- Summary: Test InputFormatCacheLoaderTest.checkCounter failed due to unexpected value on azure Key: FLINK-30103 URL: https://issues.apache.org/jira/browse/FLINK-30103 Project: Flink Issue Type: Bug Components: Table SQL / Runtime Affects Versions: 1.16.0 Reporter: Leonard Xu {noformat} Nov 20 02:43:43 [ERROR] Failures: Nov 20 02:43:43 [ERROR] InputFormatCacheLoaderTest.checkCounter:74 Nov 20 02:43:43 Expecting AtomicInteger(0) to have value: Nov 20 02:43:43 0 Nov 20 02:43:43 but did not {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43319=logs=de826397-1924-5900-0034-51895f69d4b7=f311e913-93a2-5a37-acab-4a63e1328f94 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30102) Unstable test EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow failed runs on azure
Leonard Xu created FLINK-30102: -- Summary: Unstable test EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow failed runs on azure Key: FLINK-30102 URL: https://issues.apache.org/jira/browse/FLINK-30102 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Tests Affects Versions: 1.16.0 Reporter: Leonard Xu {noformat} Nov 20 06:26:32 [ERROR] Failures: Nov 20 06:26:32 [ERROR] EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow Nov 20 06:26:32 [INFO] Run 1: PASS Nov 20 06:26:32 [ERROR] Run 2: Job execution failed. Nov 20 06:26:32 [INFO] Run 3: PASS Nov 20 06:26:32 [INFO] Run 4: PASS Nov 20 06:26:32 [INFO] Run 5: PASS Nov 20 06:26:32 [INFO] Run 6: PASS Nov 20 06:26:32 [INFO] Run 7: PASS Nov 20 06:26:32 [INFO] Run 8: PASS Nov 20 06:26:32 [INFO] Run 9: PASS Nov 20 06:26:32 [INFO] Run 10: PASS {noformat} https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43319=logs=4d4a0d10-fca2-5507-8eed-c07f0bdf4887=7b25afdf-cc6c-566f-5459-359dc2585798=10358 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30101) YARN client should
Zhanghao Chen created FLINK-30101: - Summary: YARN client should Key: FLINK-30101 URL: https://issues.apache.org/jira/browse/FLINK-30101 Project: Flink Issue Type: Improvement Components: Client / Job Submission Affects Versions: 1.16.0 Reporter: Zhanghao Chen Fix For: 1.17.0 *Problem* Currently, the procedure of retrieving a Flink on YARN cluster client is as follows (in YarnClusterDescriptor#retrieve method): # Get application report from YARN # Set rest.address & rest.port using the info from application report # Create a new RestClusterClient using the updated configuration, will use client HA serivce to fetch the rest.address & rest.port if HA is enabled Here, we can see that the usage of client HA in step 3 is redundant, as we've already got the rest.address & rest.port from YARN application report. When ZK HA is enabled, this would take ~1.5 s to initialize client HA services and fetch the rest IP & port. 1.5 s can mean a lot for latency-sensitive client operations. In my company, we use Flink client to submit short-running session jobs and e2e latency is critical. The job submission time is around 10 s on average, and 1.5s would mean 15% of time saving. *Proposal* When retrieving a Flink on YARN cluster client, use StandaloneClientHAServices to create RestClusterClient instead as we have pre-fetched rest.address & rest.port from YARN application report. This is also what we did in KubernetesClusterDescriptor. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-271: Autoscaling
On Sun, Nov 20, 2022 at 7:25 AM Gyula Fóra wrote: > Hi Chen! > > I think in the long term it makes sense to provide some pluggable > mechanisms but it's not completely trivial where exactly you would plug in > your custom logic at this point. > sounds good, more specifically would be great if it can accept input features (including previous scaling decisions) and output decisions. Folks might keep their own secret sauce and avoid patching oss fork. > > In any case the problems you mentioned should be solved robustly by the > algorithm itself without any customization: > - We need to be able to detect ineffective scaling decisions, let\s say we > scaled up (expecting better throughput with a higher parallelism) but we > did not get a better processing capacity (this would be the external > service bottleneck) > sounds good, so we would at least try restart job once (optimistic path) as design choice. > - We are evaluating metrics in windows, and we have some flexible > boundaries to avoid scaling on minor load spikes > yes, would be great if user can feed in throughput changes over different time buckets (last 10s, 30s, 1 min,5 mins) as input features > > Regards, > Gyula > > On Sun, Nov 20, 2022 at 12:28 AM Chen Qin wrote: > > > Hi Gyula, > > > > Do we think the scaler could be a plugin or hard coded ? > > We observed some cases scaler can't address (e.g async io dependency > > service degradation or small spike that doesn't worth restarting job) > > > > Thanks, > > Chen > > > > On Fri, Nov 18, 2022 at 1:03 AM Gyula Fóra wrote: > > > > > Hi Dong! > > > > > > Could you please confirm that your main concerns have been addressed? > > > > > > Some other minor details that might not have been fully clarified: > > > - The prototype has been validated on some production workloads yes > > > - We are only planning to use metrics that are generally available and > > are > > > previously accepted to be standardized connector metrics (not Kafka > > > specific). This is actually specified in the FLIP > > > - Even if some metrics (such as pendingRecords) are not accessible the > > > scaling algorithm works and can be used. For source scaling based on > > > utilization alone we still need some trivial modifications on the > > > implementation side. > > > > > > Cheers, > > > Gyula > > > > > > On Thu, Nov 17, 2022 at 5:22 PM Gyula Fóra > wrote: > > > > > > > Hi Dong! > > > > > > > > This is not an experimental feature proposal. The implementation of > the > > > > prototype is still in an experimental phase but by the time the FLIP, > > > > initial prototype and review is done, this should be in a good stable > > > first > > > > version. > > > > This proposal is pretty general as autoscalers/tuners get as far as I > > > > understand and there is no history of any alternative effort that > even > > > > comes close to the applicability of this solution. > > > > > > > > Any large features that were added to Flink in the past have gone > > through > > > > several iterations over the years and the APIs have evolved as they > > > matured. > > > > Something like the autoscaler can only be successful if there is > enough > > > > user exposure and feedback to make it good, putting it in an external > > > repo > > > > will not get us anywhere. > > > > > > > > We have a prototype implementation ready that works well and it is > more > > > or > > > > less feature complete. We proposed this FLIP based on something that > we > > > see > > > > as a working solution, please do not underestimate the effort that > went > > > > into this proposal and the validation of the ideas. So in this sense > > our > > > > approach here is the same as with the Table Store and Kubernetes > > Operator > > > > and other big components of the past. On the other hand it's > impossible > > > to > > > > sufficiently explain all the technical depth/implementation details > of > > > such > > > > complex components in FLIPs to 100%, I feel we have a good overview > of > > > the > > > > algorithm in the FLIP and the implementation should cover all > remaining > > > > questions. We will have an extended code review phase following the > > FLIP > > > > vote before this make it into the project. > > > > > > > > I understand your concern regarding the stability of Flink Kubernetes > > > > Operator config and metric names. We have decided to not provide > > > guarantees > > > > there yet but if you feel that it's time for the operator to support > > such > > > > guarantees please open a separate discussion on that topic, I don't > > want > > > to > > > > mix the two problems here. > > > > > > > > Regards, > > > > Gyula > > > > > > > > On Thu, Nov 17, 2022 at 5:07 PM Dong Lin > wrote: > > > > > > > >> Hi Gyula, > > > >> > > > >> If I understand correctly, this autopilot proposal is an > experimental > > > >> feature and its configs/metrics are not mature enough to provide > > > backward > > > >> compatibility yet. And the proposal provides high-level ideas
Re: [DISCUSS] OLM Bundles for Flink Kubernetes Operator
Hi Ted! Thank you for the proposal. I think it would be great to have an official OLM bundle similar to the Helm chart especially if we can generate the bundle during the release process. Do you think it would make sense to generate the bundle as part of the release artifacts so that it could be tested by the community? Also should we include the bundle with the official release artifacts? ( https://dist.apache.org/repos/dist/release/flink/flink-kubernetes-operator-1.2.0/ ) Thanks, Gyula On Sun, Nov 20, 2022 at 5:18 PM Hao t Chang wrote: > Hi Mark, > I got 404 too. Try this if you were installing OLM: > curl -sL > https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.22.0/install.sh > | bash -s v0.22.0 >
答复: [DISCUSS] OLM Bundles for Flink Kubernetes Operator
Hi Mark, I got 404 too. Try this if you were installing OLM: curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.22.0/install.sh | bash -s v0.22.0
Re: [DISCUSS] FLIP-271: Autoscaling
Hi Chen! I think in the long term it makes sense to provide some pluggable mechanisms but it's not completely trivial where exactly you would plug in your custom logic at this point. In any case the problems you mentioned should be solved robustly by the algorithm itself without any customization: - We need to be able to detect ineffective scaling decisions, let\s say we scaled up (expecting better throughput with a higher parallelism) but we did not get a better processing capacity (this would be the external service bottleneck) - We are evaluating metrics in windows, and we have some flexible boundaries to avoid scaling on minor load spikes Regards, Gyula On Sun, Nov 20, 2022 at 12:28 AM Chen Qin wrote: > Hi Gyula, > > Do we think the scaler could be a plugin or hard coded ? > We observed some cases scaler can't address (e.g async io dependency > service degradation or small spike that doesn't worth restarting job) > > Thanks, > Chen > > On Fri, Nov 18, 2022 at 1:03 AM Gyula Fóra wrote: > > > Hi Dong! > > > > Could you please confirm that your main concerns have been addressed? > > > > Some other minor details that might not have been fully clarified: > > - The prototype has been validated on some production workloads yes > > - We are only planning to use metrics that are generally available and > are > > previously accepted to be standardized connector metrics (not Kafka > > specific). This is actually specified in the FLIP > > - Even if some metrics (such as pendingRecords) are not accessible the > > scaling algorithm works and can be used. For source scaling based on > > utilization alone we still need some trivial modifications on the > > implementation side. > > > > Cheers, > > Gyula > > > > On Thu, Nov 17, 2022 at 5:22 PM Gyula Fóra wrote: > > > > > Hi Dong! > > > > > > This is not an experimental feature proposal. The implementation of the > > > prototype is still in an experimental phase but by the time the FLIP, > > > initial prototype and review is done, this should be in a good stable > > first > > > version. > > > This proposal is pretty general as autoscalers/tuners get as far as I > > > understand and there is no history of any alternative effort that even > > > comes close to the applicability of this solution. > > > > > > Any large features that were added to Flink in the past have gone > through > > > several iterations over the years and the APIs have evolved as they > > matured. > > > Something like the autoscaler can only be successful if there is enough > > > user exposure and feedback to make it good, putting it in an external > > repo > > > will not get us anywhere. > > > > > > We have a prototype implementation ready that works well and it is more > > or > > > less feature complete. We proposed this FLIP based on something that we > > see > > > as a working solution, please do not underestimate the effort that went > > > into this proposal and the validation of the ideas. So in this sense > our > > > approach here is the same as with the Table Store and Kubernetes > Operator > > > and other big components of the past. On the other hand it's impossible > > to > > > sufficiently explain all the technical depth/implementation details of > > such > > > complex components in FLIPs to 100%, I feel we have a good overview of > > the > > > algorithm in the FLIP and the implementation should cover all remaining > > > questions. We will have an extended code review phase following the > FLIP > > > vote before this make it into the project. > > > > > > I understand your concern regarding the stability of Flink Kubernetes > > > Operator config and metric names. We have decided to not provide > > guarantees > > > there yet but if you feel that it's time for the operator to support > such > > > guarantees please open a separate discussion on that topic, I don't > want > > to > > > mix the two problems here. > > > > > > Regards, > > > Gyula > > > > > > On Thu, Nov 17, 2022 at 5:07 PM Dong Lin wrote: > > > > > >> Hi Gyula, > > >> > > >> If I understand correctly, this autopilot proposal is an experimental > > >> feature and its configs/metrics are not mature enough to provide > > backward > > >> compatibility yet. And the proposal provides high-level ideas of the > > >> algorithm but it is probably too complicated to explain it end-to-end. > > >> > > >> On the one hand, I do agree that having an auto-tuning prototype, even > > if > > >> not mature, is better than nothing for Flink users. On the other > hand, I > > >> am > > >> concerned that this FLIP seems a bit too experimental, and starting > with > > >> an > > >> immature design might make it harder for us to reach a > production-ready > > >> and > > >> generally applicable auto-tuner in the future. And introducing too > > >> backward > > >> incompatible changes generally hurts users' trust in the Flink > project. > > >> > > >> One alternative might be to develop and experiment with this feature > in > > a > > >> non-Flink
[jira] [Created] (FLINK-30100) Remove the unused CheckpointFailureReason
Rui Fan created FLINK-30100: --- Summary: Remove the unused CheckpointFailureReason Key: FLINK-30100 URL: https://issues.apache.org/jira/browse/FLINK-30100 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.16.0 Reporter: Rui Fan Fix For: 1.17.0 Some CheckpointFailureReasons have not been used for a long time and should be removed. E.g: TOO_MANY_CONCURRENT_CHECKPOINTS CHECKPOINT_DECLINED_TASK_NOT_CHECKPOINTING CHECKPOINT_DECLINED_ALIGNMENT_LIMIT_EXCEEDED JOB_FAILURE -- This message was sent by Atlassian Jira (v8.20.10#820010)
答复: [DISCUSS] OLM Bundles for Flink Kubernetes Operator
Hi Chang, The olm download url in install.sh is 404(url: https://github.com/operator-framework/operator-lifecycle-manager/releases/ta g/v0.22.0). default_base_url=https://github.com/operator-framework/operator-lifecycle-ma nager/releases/download -邮件原件- 发件人: dev-return-65502-lifuqiong00=126@flink.apache.org 代表 Hao t Chang 发送时间: 2022年11月20日 3:40 收件人: dev@flink.apache.org 主题: [DISCUSS] OLM Bundles for Flink Kubernetes Operator Hi everyone, A while ago, we created the OLM bundle for Flink Kubernetes Operator on the OperatorHub [1]; however, the bundle is not endorsed by the Apache Flink community. I would like to ask your opinion how we can make the bundle more official or even as a part of the Flink Kubernetes Operator release. Please look at the document [2] where I have addressed some concerns from the community. I will be happy to answer any questions. Thank you. For people who are not familiar with OLM(Operator Lifecycle manager), it manages the lifecycle of Operators on Kubernetes clusters and comes by default on the OpenShift Container Platform. The OperatorHub is the online catalog for users to search and install Operators with OLM. [1] https://operatorhub.io/operator/flink-kubernetes-operator [2] https://docs.google.com/document/d/1BiLzfPYb3Thzk01H4teB8_Crqs5FydQYS2Utj11H lLs/edit?usp=sharing -- Best, Ted Chang | Software Engineer | htch...@us.ibm.com