[jira] [Created] (FLINK-30110) Enable from-timestamp log scan when timestamp-millis is configured

2022-11-20 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-30110:


 Summary: Enable from-timestamp log scan when timestamp-millis is 
configured
 Key: FLINK-30110
 URL: https://issues.apache.org/jira/browse/FLINK-30110
 Project: Flink
  Issue Type: Improvement
  Components: Table Store
Reporter: Jingsong Lee
 Fix For: table-store-0.3.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30109) Checked exceptions are sneakingly transformed into unchecked exceptions in the Pulsar

2022-11-20 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30109:
-

 Summary: Checked exceptions are sneakingly transformed into 
unchecked exceptions in the Pulsar
 Key: FLINK-30109
 URL: https://issues.apache.org/jira/browse/FLINK-30109
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / Pulsar, Documentation
Affects Versions: 1.15.2, 1.16.0, 1.17.0
Reporter: Matthias Pohl


[PulsarExceptionUtils|https://github.com/apache/flink/blob/c675f786c51038801161e861826d1c54654f0dde/flink-connectors/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/common/utils/PulsarExceptionUtils.java#L33]
 provides {{sneaky*}} utility methods for hiding checked exceptions. This is 
rather unusual coding. Based on what's provided in the code I would have 
concerns as a reader that we're not handling errors properly in calling code.

Either, we remove these methods and add proper exception handling or we add 
proper documentation on why this workaround is necessary.

[~syhily] already hinted in his [FLINK-29830 PR 
comment|https://github.com/apache/flink/pull/21252#discussion_r1019822514] that 
this is related to flaws of the Pulsar API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30108) flink-core module tests exited with code 143

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30108:
--

 Summary: flink-core module tests exited with code 143
 Key: FLINK-30108
 URL: https://issues.apache.org/jira/browse/FLINK-30108
 Project: Flink
  Issue Type: Bug
  Components: API / Core, Tests
Affects Versions: 1.17.0
Reporter: Leonard Xu



{noformat}

Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time 
elapsed: 109.22 s - in 
org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
Nov 18 01:18:09 
==
Nov 18 01:18:09 Process produced no output for 900 seconds.
Nov 18 01:18:09 
==
Nov 18 01:18:09 
==
Nov 18 01:18:09 The following Java processes are running (JPS)
Nov 18 01:18:09 
==
Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
Nov 18 01:18:09 924 Launcher
Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
Nov 18 01:18:09 11885 Jps
Nov 18 01:18:09 
==
Nov 18 01:18:09 Printing stack trace of Java process 924
Nov 18 01:18:09 
==
Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
Nov 18 01:18:09 2022-11-18 01:18:09
Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
mode):

...
...
...
Nov 18 01:18:09 
==
Nov 18 01:18:09 Printing stack trace of Java process 11885
Nov 18 01:18:09 
==
11885: No such process
Nov 18 01:18:09 Killing process with pid=923 and all descendants
/__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated  $cmd
Nov 18 01:18:10 Process exited with EXIT CODE: 143.
Nov 18 01:18:10 Trying to KILL watchdog (919).
Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in '/__w/2/s'
Nov 18 01:18:16 Moving 
'/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
 to target directory ('/__w/_temp/debug_files')
Nov 18 01:18:16 Moving 
'/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
 to target directory ('/__w/_temp/debug_files')
The STDIO streams did not close within 10 seconds of the exit event from 
process '/bin/bash'. This may indicate a child process inherited the STDIO 
streams and has not yet exited.
##[error]Bash exited with code '143'.
Finishing: Test - core
{noformat}


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277=logs=0e7be18f-84f2-53f0-a32d-4a5e4a174679=7c1d86e3-35bd-5fd5-3b7c-30c126a78702




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30107) ChangelogRecoveryITCase#testMaterialization

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30107:
--

 Summary: ChangelogRecoveryITCase#testMaterialization
 Key: FLINK-30107
 URL: https://issues.apache.org/jira/browse/FLINK-30107
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.16.0
Reporter: Leonard Xu



{noformat}

Nov 18 06:18:39 [ERROR] Errors: 
Nov 18 06:18:39 [ERROR] ChangelogRecoveryITCase.testMaterialization
Nov 18 06:18:39 [INFO]   Run 1: PASS
Nov 18 06:18:39 [ERROR]   Run 2: org.apache.flink.runtime.JobException: 
Recovery is suppressed by 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2, 
backoffTimeMS=0)
Nov 18 06:18:39 [INFO]   Run 3: PASS
Nov 18 06:18:39 [INFO] 
Nov 18 06:18:39 [INFO] 
{noformat}


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43279=logs=4d4a0d10-fca2-5507-8eed-c07f0bdf4887=7b25afdf-cc6c-566f-5459-359dc2585798



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30106) Build python wheels on macs failed due to install for crcmod error

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30106:
--

 Summary: Build python wheels on macs failed due to install for 
crcmod error
 Key: FLINK-30106
 URL: https://issues.apache.org/jira/browse/FLINK-30106
 Project: Flink
  Issue Type: Bug
  Components: API / Python, Build System / CI
Affects Versions: 1.16.0
Reporter: Leonard Xu




{noformat}
note: This error originates from a subprocess, and is likely not a 
problem with pip.
ERROR: Failed building wheel for crcmod
Running setup.py clean for crcmod
error: subprocess-exited-with-error
  
× python setup.py clean did not run successfully.
│ exit code: 1
╰─> [14 lines of output]
error: Multiple top-level packages discovered in a flat-layout: 
['python3', 'python2'].
  
To avoid accidental inclusion of unwanted files or directories,
setuptools will not proceed with this build.
  
If you are trying to create a single distribution with multiple 
packages
on purpose, you should not rely on automatic discovery.
Instead, consider the following options:
  
1. set up custom discovery (`find` directive with `include` or 
`exclude`)
2. use a `src-layout`
3. explicitly set `py_modules` or `packages` with a list of names
  
To find more information, look for "package discovery" on 
setuptools docs.
[end of output]
  
note: This error originates from a subprocess, and is likely not a 
problem with pip.
ERROR: Failed cleaning build dir for crcmod
Building wheel for dill (setup.py): started
Building wheel for dill (setup.py): finished with status 'done'
Created wheel for dill: filename=dill-0.3.1.1-py3-none-any.whl 
size=78544 
sha256=9ce160b3c3e2e1dcd24e136c59fb84ef9bd072b4715c6b34536d3ff9c39a5962
{noformat}




https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43279=logs=f73b5736-8355-5390-ec71-4dfdec0ce6c5=90f7230e-bf5a-531b-8566-ad48d3e03bbb



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30105) SubtaskExecutionAttemptAccumulatorsInfoTest failed with JVM exit code 239

2022-11-20 Thread Qingsheng Ren (Jira)
Qingsheng Ren created FLINK-30105:
-

 Summary: SubtaskExecutionAttemptAccumulatorsInfoTest failed with 
JVM exit code 239 
 Key: FLINK-30105
 URL: https://issues.apache.org/jira/browse/FLINK-30105
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.17.0
Reporter: Qingsheng Ren


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43124=logs=0da23115-68bb-5dcd-192c-bd4c8adebde1=24c3384f-1bcb-57b3-224f-51bf973bbee8=8164]
{code:java}
Nov 14 08:08:38 [ERROR] 
org.apache.flink.runtime.rest.messages.job.SubtaskExecutionAttemptAccumulatorsInfoTest
Nov 14 08:08:38 [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
Nov 14 08:08:38 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m 
-jar 
/__w/1/s/flink-runtime/target/surefire/surefirebooter285048696362473297.jar 
/__w/1/s/flink-runtime/target/surefire 2022-11-14T08-03-00_124-jvmRun3 
surefire3531696916359342131tmp surefire_26906836589585616499tmp
Nov 14 08:08:38 [ERROR] Error occurred in starting fork, check output in log
Nov 14 08:08:38 [ERROR] Process Exit Code: 239
Nov 14 08:08:38 [ERROR] Crashed tests:
Nov 14 08:08:38 [ERROR] 
org.apache.flink.runtime.rest.messages.job.SubtaskExecutionAttemptAccumulatorsInfoTest
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
Nov 14 08:08:38 [ERROR] at 
org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
Nov 14 08:08:38 [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
Nov 14 08:08:38 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Nov 14 08:08:38 [ERROR] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Nov 14 08:08:38 [ERROR] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Nov 14 08:08:38 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
Nov 14 08:08:38 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
Nov 14 08:08:38 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
Nov 14 08:08:38 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
Nov 14 08:08:38 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Nov 14 08:08:38 [ERROR] Caused by: 
org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
terminated without properly saying goodbye. VM crash or System.exit called?
Nov 14 08:08:38 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 

[jira] [Created] (FLINK-30104) Snapshot deployment failed due to Could not transfer artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 from/to apache.snapshots

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30104:
--

 Summary: Snapshot deployment failed due to Could not transfer 
artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 
from/to apache.snapshots
 Key: FLINK-30104
 URL: https://issues.apache.org/jira/browse/FLINK-30104
 Project: Flink
  Issue Type: Bug
  Components: Build System / CI
Affects Versions: 1.15.2
Reporter: Leonard Xu



{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on 
project flink-runtime-web: Failed to deploy artifacts: Could not transfer 
artifact org.apache.flink:flink-runtime-web:jar:1.15-20221119.010836-325 
from/to apache.snapshots.https 
(https://repository.apache.org/content/repositories/snapshots): Failed to 
transfer file: 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime-web/1.15-SNAPSHOT/flink-runtime-web-1.15-20221119.010836-325.jar.
 Return code is: 502, ReasonPhrase: Proxy Error. -> [Help 1]
{noformat}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43308=logs=eca6b3a6-1600-56cc-916a-c549b3cde3ff=e9844b5e-5aa3-546b-6c3e-5395c7c0cac7




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30103) Test InputFormatCacheLoaderTest.checkCounter failed due to unexpected value on azure

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30103:
--

 Summary: Test InputFormatCacheLoaderTest.checkCounter failed due 
to unexpected value on azure
 Key: FLINK-30103
 URL: https://issues.apache.org/jira/browse/FLINK-30103
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.16.0
Reporter: Leonard Xu



{noformat}
Nov 20 02:43:43 [ERROR] Failures: 
Nov 20 02:43:43 [ERROR]   InputFormatCacheLoaderTest.checkCounter:74 
Nov 20 02:43:43 Expecting AtomicInteger(0) to have value:
Nov 20 02:43:43   0
Nov 20 02:43:43 but did not
{noformat}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43319=logs=de826397-1924-5900-0034-51895f69d4b7=f311e913-93a2-5a37-acab-4a63e1328f94




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30102) Unstable test EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow failed runs on azure

2022-11-20 Thread Leonard Xu (Jira)
Leonard Xu created FLINK-30102:
--

 Summary: Unstable test 
EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow failed 
runs on azure
 Key: FLINK-30102
 URL: https://issues.apache.org/jira/browse/FLINK-30102
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Tests
Affects Versions: 1.16.0
Reporter: Leonard Xu



 
{noformat}
 Nov 20 06:26:32 [ERROR] Failures: 
Nov 20 06:26:32 [ERROR] 
EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow
Nov 20 06:26:32 [INFO]   Run 1: PASS
Nov 20 06:26:32 [ERROR]   Run 2: Job execution failed.
Nov 20 06:26:32 [INFO]   Run 3: PASS
Nov 20 06:26:32 [INFO]   Run 4: PASS
Nov 20 06:26:32 [INFO]   Run 5: PASS
Nov 20 06:26:32 [INFO]   Run 6: PASS
Nov 20 06:26:32 [INFO]   Run 7: PASS
Nov 20 06:26:32 [INFO]   Run 8: PASS
Nov 20 06:26:32 [INFO]   Run 9: PASS
Nov 20 06:26:32 [INFO]   Run 10: PASS
{noformat}



https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43319=logs=4d4a0d10-fca2-5507-8eed-c07f0bdf4887=7b25afdf-cc6c-566f-5459-359dc2585798=10358



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30101) YARN client should

2022-11-20 Thread Zhanghao Chen (Jira)
Zhanghao Chen created FLINK-30101:
-

 Summary: YARN client should 
 Key: FLINK-30101
 URL: https://issues.apache.org/jira/browse/FLINK-30101
 Project: Flink
  Issue Type: Improvement
  Components: Client / Job Submission
Affects Versions: 1.16.0
Reporter: Zhanghao Chen
 Fix For: 1.17.0


*Problem*

Currently, the procedure of retrieving a Flink on YARN cluster client is as 
follows (in YarnClusterDescriptor#retrieve method):
 # Get application report from YARN
 # Set rest.address & rest.port using the info from application report
 # Create a new RestClusterClient using the updated configuration, will use 
client HA serivce to fetch the rest.address & rest.port if HA is enabled

Here, we can see that the usage of client HA in step 3 is redundant, as we've 
already got the rest.address & rest.port from YARN application report. When ZK 
HA is enabled, this would take ~1.5 s to initialize client HA services and 
fetch the rest IP & port. 

1.5 s can mean a lot for latency-sensitive client operations.  In my company, 
we use Flink client to submit short-running session jobs and e2e latency is 
critical. The job submission time is around 10 s on average, and 1.5s would 
mean 15% of time saving. 

*Proposal*

When retrieving a Flink on YARN cluster client, use StandaloneClientHAServices 
to
create RestClusterClient instead as we have pre-fetched rest.address & 
rest.port from YARN application report. This is also what we did in 
KubernetesClusterDescriptor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-20 Thread Chen Qin
On Sun, Nov 20, 2022 at 7:25 AM Gyula Fóra  wrote:

> Hi Chen!
>
> I think in the long term it makes sense to provide some pluggable
> mechanisms but it's not completely trivial where exactly you would plug in
> your custom logic at this point.
>
sounds good, more specifically would be great if it can accept input features
(including previous scaling decisions) and output decisions.
Folks might keep their own secret sauce and avoid patching oss fork.

>
> In any case the problems you mentioned should be solved robustly by the
> algorithm itself without any customization:
>  - We need to be able to detect ineffective scaling decisions, let\s say we
> scaled up (expecting better throughput with a higher parallelism) but we
> did not get a better processing capacity (this would be the external
> service bottleneck)
>
sounds good, so we would at least try restart job once (optimistic path) as
design choice.

>  - We are evaluating metrics in windows, and we have some flexible
> boundaries to avoid scaling on minor load spikes
>
yes, would be great if user can feed in throughput changes over different
time buckets (last 10s, 30s, 1 min,5 mins) as input features

>
> Regards,
> Gyula
>
> On Sun, Nov 20, 2022 at 12:28 AM Chen Qin  wrote:
>
> > Hi Gyula,
> >
> > Do we think the scaler could be a plugin or hard coded ?
> > We observed some cases scaler can't address (e.g async io dependency
> > service degradation or small spike that doesn't worth restarting job)
> >
> > Thanks,
> > Chen
> >
> > On Fri, Nov 18, 2022 at 1:03 AM Gyula Fóra  wrote:
> >
> > > Hi Dong!
> > >
> > > Could you please confirm that your main concerns have been addressed?
> > >
> > > Some other minor details that might not have been fully clarified:
> > >  - The prototype has been validated on some production workloads yes
> > >  - We are only planning to use metrics that are generally available and
> > are
> > > previously accepted to be standardized connector metrics (not Kafka
> > > specific). This is actually specified in the FLIP
> > >  - Even if some metrics (such as pendingRecords) are not accessible the
> > > scaling algorithm works and can be used. For source scaling based on
> > > utilization alone we still need some trivial modifications on the
> > > implementation side.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Thu, Nov 17, 2022 at 5:22 PM Gyula Fóra 
> wrote:
> > >
> > > > Hi Dong!
> > > >
> > > > This is not an experimental feature proposal. The implementation of
> the
> > > > prototype is still in an experimental phase but by the time the FLIP,
> > > > initial prototype and review is done, this should be in a good stable
> > > first
> > > > version.
> > > > This proposal is pretty general as autoscalers/tuners get as far as I
> > > > understand and there is no history of any alternative effort that
> even
> > > > comes close to the applicability of this solution.
> > > >
> > > > Any large features that were added to Flink in the past have gone
> > through
> > > > several iterations over the years and the APIs have evolved as they
> > > matured.
> > > > Something like the autoscaler can only be successful if there is
> enough
> > > > user exposure and feedback to make it good, putting it in an external
> > > repo
> > > > will not get us anywhere.
> > > >
> > > > We have a prototype implementation ready that works well and it is
> more
> > > or
> > > > less feature complete. We proposed this FLIP based on something that
> we
> > > see
> > > > as a working solution, please do not underestimate the effort that
> went
> > > > into this proposal and the validation of the ideas. So in this sense
> > our
> > > > approach here is the same as with the Table Store and Kubernetes
> > Operator
> > > > and other big components of the past. On the other hand it's
> impossible
> > > to
> > > > sufficiently explain all the technical depth/implementation details
> of
> > > such
> > > > complex components in FLIPs to 100%, I feel we have a good overview
> of
> > > the
> > > > algorithm in the FLIP and the implementation should cover all
> remaining
> > > > questions. We will have an extended code review phase following the
> > FLIP
> > > > vote before this make it into the project.
> > > >
> > > > I understand your concern regarding the stability of Flink Kubernetes
> > > > Operator config and metric names. We have decided to not provide
> > > guarantees
> > > > there yet but if you feel that it's time for the operator to support
> > such
> > > > guarantees please open a separate discussion on that topic, I don't
> > want
> > > to
> > > > mix the two problems here.
> > > >
> > > > Regards,
> > > > Gyula
> > > >
> > > > On Thu, Nov 17, 2022 at 5:07 PM Dong Lin 
> wrote:
> > > >
> > > >> Hi Gyula,
> > > >>
> > > >> If I understand correctly, this autopilot proposal is an
> experimental
> > > >> feature and its configs/metrics are not mature enough to provide
> > > backward
> > > >> compatibility yet. And the proposal provides high-level ideas 

Re: [DISCUSS] OLM Bundles for Flink Kubernetes Operator

2022-11-20 Thread Gyula Fóra
Hi Ted!

Thank you for the proposal. I think it would be great to have an official
OLM bundle similar to the Helm chart especially if we can generate the
bundle during the release process.

Do you think it would make sense to generate the bundle as part of the
release artifacts so that it could be tested by the community?
Also should we include the bundle with the official release artifacts? (
https://dist.apache.org/repos/dist/release/flink/flink-kubernetes-operator-1.2.0/
)

Thanks,
Gyula




On Sun, Nov 20, 2022 at 5:18 PM Hao t Chang  wrote:

> Hi Mark,
> I got 404 too. Try this if you were installing OLM:
> curl -sL
> https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.22.0/install.sh
> | bash -s v0.22.0
>


答复: [DISCUSS] OLM Bundles for Flink Kubernetes Operator

2022-11-20 Thread Hao t Chang
Hi Mark,
I got 404 too. Try this if you were installing OLM:
curl -sL 
https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.22.0/install.sh
 | bash -s v0.22.0


Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-20 Thread Gyula Fóra
Hi Chen!

I think in the long term it makes sense to provide some pluggable
mechanisms but it's not completely trivial where exactly you would plug in
your custom logic at this point.

In any case the problems you mentioned should be solved robustly by the
algorithm itself without any customization:
 - We need to be able to detect ineffective scaling decisions, let\s say we
scaled up (expecting better throughput with a higher parallelism) but we
did not get a better processing capacity (this would be the external
service bottleneck)
 - We are evaluating metrics in windows, and we have some flexible
boundaries to avoid scaling on minor load spikes

Regards,
Gyula

On Sun, Nov 20, 2022 at 12:28 AM Chen Qin  wrote:

> Hi Gyula,
>
> Do we think the scaler could be a plugin or hard coded ?
> We observed some cases scaler can't address (e.g async io dependency
> service degradation or small spike that doesn't worth restarting job)
>
> Thanks,
> Chen
>
> On Fri, Nov 18, 2022 at 1:03 AM Gyula Fóra  wrote:
>
> > Hi Dong!
> >
> > Could you please confirm that your main concerns have been addressed?
> >
> > Some other minor details that might not have been fully clarified:
> >  - The prototype has been validated on some production workloads yes
> >  - We are only planning to use metrics that are generally available and
> are
> > previously accepted to be standardized connector metrics (not Kafka
> > specific). This is actually specified in the FLIP
> >  - Even if some metrics (such as pendingRecords) are not accessible the
> > scaling algorithm works and can be used. For source scaling based on
> > utilization alone we still need some trivial modifications on the
> > implementation side.
> >
> > Cheers,
> > Gyula
> >
> > On Thu, Nov 17, 2022 at 5:22 PM Gyula Fóra  wrote:
> >
> > > Hi Dong!
> > >
> > > This is not an experimental feature proposal. The implementation of the
> > > prototype is still in an experimental phase but by the time the FLIP,
> > > initial prototype and review is done, this should be in a good stable
> > first
> > > version.
> > > This proposal is pretty general as autoscalers/tuners get as far as I
> > > understand and there is no history of any alternative effort that even
> > > comes close to the applicability of this solution.
> > >
> > > Any large features that were added to Flink in the past have gone
> through
> > > several iterations over the years and the APIs have evolved as they
> > matured.
> > > Something like the autoscaler can only be successful if there is enough
> > > user exposure and feedback to make it good, putting it in an external
> > repo
> > > will not get us anywhere.
> > >
> > > We have a prototype implementation ready that works well and it is more
> > or
> > > less feature complete. We proposed this FLIP based on something that we
> > see
> > > as a working solution, please do not underestimate the effort that went
> > > into this proposal and the validation of the ideas. So in this sense
> our
> > > approach here is the same as with the Table Store and Kubernetes
> Operator
> > > and other big components of the past. On the other hand it's impossible
> > to
> > > sufficiently explain all the technical depth/implementation details of
> > such
> > > complex components in FLIPs to 100%, I feel we have a good overview of
> > the
> > > algorithm in the FLIP and the implementation should cover all remaining
> > > questions. We will have an extended code review phase following the
> FLIP
> > > vote before this make it into the project.
> > >
> > > I understand your concern regarding the stability of Flink Kubernetes
> > > Operator config and metric names. We have decided to not provide
> > guarantees
> > > there yet but if you feel that it's time for the operator to support
> such
> > > guarantees please open a separate discussion on that topic, I don't
> want
> > to
> > > mix the two problems here.
> > >
> > > Regards,
> > > Gyula
> > >
> > > On Thu, Nov 17, 2022 at 5:07 PM Dong Lin  wrote:
> > >
> > >> Hi Gyula,
> > >>
> > >> If I understand correctly, this autopilot proposal is an experimental
> > >> feature and its configs/metrics are not mature enough to provide
> > backward
> > >> compatibility yet. And the proposal provides high-level ideas of the
> > >> algorithm but it is probably too complicated to explain it end-to-end.
> > >>
> > >> On the one hand, I do agree that having an auto-tuning prototype, even
> > if
> > >> not mature, is better than nothing for Flink users. On the other
> hand, I
> > >> am
> > >> concerned that this FLIP seems a bit too experimental, and starting
> with
> > >> an
> > >> immature design might make it harder for us to reach a
> production-ready
> > >> and
> > >> generally applicable auto-tuner in the future. And introducing too
> > >> backward
> > >> incompatible changes generally hurts users' trust in the Flink
> project.
> > >>
> > >> One alternative might be to develop and experiment with this feature
> in
> > a
> > >> non-Flink 

[jira] [Created] (FLINK-30100) Remove the unused CheckpointFailureReason

2022-11-20 Thread Rui Fan (Jira)
Rui Fan created FLINK-30100:
---

 Summary: Remove the unused CheckpointFailureReason
 Key: FLINK-30100
 URL: https://issues.apache.org/jira/browse/FLINK-30100
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.16.0
Reporter: Rui Fan
 Fix For: 1.17.0


Some CheckpointFailureReasons have not been used for a long time and should be 
removed. E.g: 

TOO_MANY_CONCURRENT_CHECKPOINTS

CHECKPOINT_DECLINED_TASK_NOT_CHECKPOINTING

CHECKPOINT_DECLINED_ALIGNMENT_LIMIT_EXCEEDED

JOB_FAILURE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


答复: [DISCUSS] OLM Bundles for Flink Kubernetes Operator

2022-11-20 Thread Mark Lee
Hi Chang,
   The olm download url in install.sh is 404(url:
https://github.com/operator-framework/operator-lifecycle-manager/releases/ta
g/v0.22.0). 
default_base_url=https://github.com/operator-framework/operator-lifecycle-ma
nager/releases/download

-邮件原件-
发件人: dev-return-65502-lifuqiong00=126@flink.apache.org
 代表 Hao t Chang
发送时间: 2022年11月20日 3:40
收件人: dev@flink.apache.org
主题: [DISCUSS] OLM Bundles for Flink Kubernetes Operator

Hi everyone,



A while ago, we created the OLM bundle for Flink Kubernetes Operator on the
OperatorHub [1]; however, the bundle is not endorsed by the Apache Flink
community. I would like to ask your opinion how we can make the bundle more
official or even as a part of the Flink Kubernetes Operator release. Please
look at the document [2] where I have addressed some concerns from the
community. I will be happy to answer any questions. Thank you.



For people who are not familiar with OLM(Operator Lifecycle manager), it
manages the lifecycle of Operators on Kubernetes clusters and comes by
default on the OpenShift Container Platform. The OperatorHub is the online
catalog for users to search and install Operators with OLM.



[1] https://operatorhub.io/operator/flink-kubernetes-operator
[2]
https://docs.google.com/document/d/1BiLzfPYb3Thzk01H4teB8_Crqs5FydQYS2Utj11H
lLs/edit?usp=sharing

--
Best,
Ted Chang | Software Engineer | htch...@us.ibm.com