[jira] [Created] (FLINK-35509) Slack community invite link has expired
Ufuk Celebi created FLINK-35509: --- Summary: Slack community invite link has expired Key: FLINK-35509 URL: https://issues.apache.org/jira/browse/FLINK-35509 Project: Flink Issue Type: Bug Components: Project Website Reporter: Ufuk Celebi The Slack invite link on the website has expired. I've generated a new invite link without expiration here: [https://join.slack.com/t/apache-flink/shared_invite/zt-2k0fdioxx-D0kTYYLh3pPjMu5IItqx3Q] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35481) Add HISTOGRAM function in SQL & Table API
Ufuk Celebi created FLINK-35481: --- Summary: Add HISTOGRAM function in SQL & Table API Key: FLINK-35481 URL: https://issues.apache.org/jira/browse/FLINK-35481 Project: Flink Issue Type: Improvement Components: Table SQL / API Reporter: Ufuk Celebi Fix For: 1.20.0 Consider adding a HISTOGRAM aggregate function similar to ksqlDB (https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/#histogram). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35480) Add FIELD function in SQL & Table API
Ufuk Celebi created FLINK-35480: --- Summary: Add FIELD function in SQL & Table API Key: FLINK-35480 URL: https://issues.apache.org/jira/browse/FLINK-35480 Project: Flink Issue Type: Improvement Components: Table SQL / API Reporter: Ufuk Celebi Fix For: 1.20.0 Add support for the {{FIELD}} function to return the position of {{str}} in {{{}args{}}}, or 0 if not found. *References* * ksqlDB: [https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#field] * MySQL: https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_elt -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35239) 1.19 docs show outdated warning
Ufuk Celebi created FLINK-35239: --- Summary: 1.19 docs show outdated warning Key: FLINK-35239 URL: https://issues.apache.org/jira/browse/FLINK-35239 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.19.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Fix For: 1.19.0 Attachments: Screenshot 2024-04-25 at 15.01.57.png The docs for 1.19 are currently marked as outdated although it's the currently stable release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35038) Bump test dependency org.yaml:snakeyaml to 2.2
Ufuk Celebi created FLINK-35038: --- Summary: Bump test dependency org.yaml:snakeyaml to 2.2 Key: FLINK-35038 URL: https://issues.apache.org/jira/browse/FLINK-35038 Project: Flink Issue Type: Technical Debt Components: Connectors / Kafka Affects Versions: 3.1.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Fix For: 3.1.0 Usage of SnakeYAML via {{flink-shaded}} was replaced by an explicit test scope dependency on {{org.yaml:snakeyaml:1.31}} with FLINK-34193. This outdated version of SnakeYAML triggers security warnings. These should not be an actual issue given the test scope, but we should consider bumping the version for security hygiene purposes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover
Ufuk Celebi created FLINK-21928: --- Summary: DuplicateJobSubmissionException after JobManager failover Key: FLINK-21928 URL: https://issues.apache.org/jira/browse/FLINK-21928 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.12.2, 1.11.3, 1.10.3 Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, High Availability enabled Reporter: Ufuk Celebi Consider the following scenario: * Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, high availability enabled * Flink job reaches a globally terminal state * Flink job is marked as finished in the high-availability service's RunningJobsRegistry * The JobManager fails over On recovery, the [Dispatcher throws DuplicateJobSubmissionException, because the job is marked as done in the RunningJobsRegistry|https://github.com/apache/flink/blob/release-1.12.2/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L332-L340]. When this happens, users cannot get out of the situation without manually redeploying the JobManager process and changing the job ID^1^. The desired semantics are that we don't want to re-execute a job that has reached a globally terminal state. In this particular case, we know that the job has already reached such a state (as it has been marked in the registry). Therefore, we could handle this case by executing the regular termination sequence instead of throwing a DuplicateJobSubmission. --- ^1^ With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, there is no notion of ephemeral data that is tied to a session in the first place afaik. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17500) Deploy JobGraph form file in StandaloneClusterEntrypoint
Ufuk Celebi created FLINK-17500: --- Summary: Deploy JobGraph form file in StandaloneClusterEntrypoint Key: FLINK-17500 URL: https://issues.apache.org/jira/browse/FLINK-17500 Project: Flink Issue Type: Wish Components: Deployment / Docker Reporter: Ufuk Celebi We have a requirement to deploy a pre-generated {{JobGraph}} from a file in {{StandaloneClusterEntrypoint}}. Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a Flink job from the class path using {{ClassPathPackagedProgramRetriever}}. Our desired behaviour would be as follows: If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy using {{ClassPathPackagedProgramRetriever}} (current behavior). --- I understand that this requirement is pretty niche, but wanted to get feedback whether the Flink community would be open to supporting this nonetheless. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15831) Add Docker image publication to release documentation
Ufuk Celebi created FLINK-15831: --- Summary: Add Docker image publication to release documentation Key: FLINK-15831 URL: https://issues.apache.org/jira/browse/FLINK-15831 Project: Flink Issue Type: Sub-task Components: Documentation Reporter: Ufuk Celebi The [release documentation in the project Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] describes the release process. We need to add a note to follow up with the Docker image publication process as part of the release checklist. The actual documentation should probably be self-contained in the apache/flink-docker repository, but we should definitely link to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker
Ufuk Celebi created FLINK-15830: --- Summary: Migrate docker-flink/docker-flink to apache/flink-docker Key: FLINK-15830 URL: https://issues.apache.org/jira/browse/FLINK-15830 Project: Flink Issue Type: Sub-task Reporter: Ufuk Celebi Assignee: Patrick Lucas -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15829) Request apache/flink-docker repository
Ufuk Celebi created FLINK-15829: --- Summary: Request apache/flink-docker repository Key: FLINK-15829 URL: https://issues.apache.org/jira/browse/FLINK-15829 Project: Flink Issue Type: Sub-task Reporter: Ufuk Celebi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process
Ufuk Celebi created FLINK-15828: --- Summary: Integrate docker-flink/docker-flink into Flink release process Key: FLINK-15828 URL: https://issues.apache.org/jira/browse/FLINK-15828 Project: Flink Issue Type: Improvement Components: Deployment / Docker, Release System Reporter: Ufuk Celebi This ticket tracks the first phase of Flink Docker image build consolidation. The goal of this story is to integrate Docker image publication with the Flink release process and provide convenience packages of released Flink artifacts on DockerHub. For more details, check the [DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html] and [VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html] threads on the mailing list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint
Ufuk Celebi created FLINK-14145: --- Summary: getLatestCheckpoint returns wrong checkpoint Key: FLINK-14145 URL: https://issues.apache.org/jira/browse/FLINK-14145 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.9.0 Reporter: Ufuk Celebi The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks returns the wrong checkpoint as the latest one if enabled. The current implementation assumes that the latest checkpoint is a savepoint and skips over it. I attached a patch for {{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-12813) Add Hadoop profile in building from source docs
Ufuk Celebi created FLINK-12813: --- Summary: Add Hadoop profile in building from source docs Key: FLINK-12813 URL: https://issues.apache.org/jira/browse/FLINK-12813 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.8.0 Reporter: Ufuk Celebi The docs for [building from source with Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions] omit the {{-Pinclude-hadoop}} profile in two code snippets. The two code snippets that have {{-Dhadoop.version}} set, but no {{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable
Ufuk Celebi created FLINK-12313: --- Summary: SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable Key: FLINK-12313 URL: https://issues.apache.org/jira/browse/FLINK-12313 Project: Flink Issue Type: Test Components: Runtime / Checkpointing Reporter: Ufuk Celebi {{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}} fails and prints the Thread stack traces due to no output on Travis occasionally. {code} == Printing stack trace of Java process 10071 == 2019-04-24 07:55:29 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode): "Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 nid=0x2cf5 waiting on condition [0x] java.lang.Thread.State: RUNNABLE "Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x8bb5e558> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8bac58f8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66) - locked <0x8bac58f8> (a java.lang.Object) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604) at org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174) at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) "CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 nid=0x27a7 in Object.wait() [0x7f292d0ab000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) - locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) at org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193) "Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in Object.wait() [0x7f292d1ac000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x8e63f7d8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63) - locked <0x8e63f7d8> (a java.lang.Object) at org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.run(SynchronousCheckpointITCase.java:161) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:724) at java.lang.Thread.run(Thread.java:748) "process reaper" #11 daemon prio=10 os_prio=0 tid=0x7f294885e000 nid=0x2793 waiting on condition [0x7f292d7e5000]
[jira] [Created] (FLINK-12060) Unify change pom version scripts
Ufuk Celebi created FLINK-12060: --- Summary: Unify change pom version scripts Key: FLINK-12060 URL: https://issues.apache.org/jira/browse/FLINK-12060 Project: Flink Issue Type: Bug Components: Deployment / Scripts Affects Versions: 1.8.0 Reporter: Ufuk Celebi We have three places in `tools` that we use to update the pom versions when releasing: 1. https://github.com/apache/flink/blob/048367b/tools/change-version.sh#L31 2. https://github.com/apache/flink/blob/048367b/tools/releasing/create_release_branch.sh#L60 3. https://github.com/apache/flink/blob/048367b/tools/releasing/update_branch_version.sh#L52 The 1st option is buggy (it does not work with the new versioning of the shaded Hadoop build, e.g. {{2.4.1-1.9-SNAPSHOT}} will not be replaced). The 2nd and 3rd work for pom files, but the 2nd one misses a change for the doc version that is present in the 3rd one. I think we should unify these and call them where needed instead of duplicating this code in unexpected ways. An initial quick fix could remove the 1st script and update the 2rd one to match the 3rd one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis
Ufuk Celebi created FLINK-11784: --- Summary: KryoSerializerSnapshotTest occasionally fails on Travis Key: FLINK-11784 URL: https://issues.apache.org/jira/browse/FLINK-11784 Project: Flink Issue Type: Bug Components: API / Type Serialization System Reporter: Ufuk Celebi {{KryoSerializerSnapshotTest}} fails occasionally with: {code:java} 11:37:44.198 [ERROR] tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest) Time elapsed: 0.011 s <<< ERROR! java.io.EOFException at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120) at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code} See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output (as part of a PR with unrelated changes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11752) Move flink-python to opt
Ufuk Celebi created FLINK-11752: --- Summary: Move flink-python to opt Key: FLINK-11752 URL: https://issues.apache.org/jira/browse/FLINK-11752 Project: Flink Issue Type: Improvement Components: Build System, Python API Affects Versions: 1.7.2 Reporter: Ufuk Celebi Assignee: Ufuk Celebi As discussed on the [dev mailing list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html], we should move flink-python to opt instead of having it in lib by default. The streaming counter part (flink-streaming-python) is only as part of opt already. I think we don't have many users of the Python batch API and this will make the streaming/batch experience more consistent and would result in a cleaner default classpath. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint
Ufuk Celebi created FLINK-11545: --- Summary: Add option to manually set job ID in StandaloneJobClusterEntryPoint Key: FLINK-11545 URL: https://issues.apache.org/jira/browse/FLINK-11545 Project: Flink Issue Type: Sub-task Components: Cluster Management, Kubernetes Affects Versions: 1.7.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Add an option to specify the job ID during job submissions via the StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all zeros currently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11546) Add option to manually set job ID in CLI
Ufuk Celebi created FLINK-11546: --- Summary: Add option to manually set job ID in CLI Key: FLINK-11546 URL: https://issues.apache.org/jira/browse/FLINK-11546 Project: Flink Issue Type: Sub-task Components: Client Affects Versions: 1.7.0 Reporter: Ufuk Celebi Add an option to specify the job ID during job submissions via the CLI. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11544) Add option manually set job ID for job submissions via REST API
Ufuk Celebi created FLINK-11544: --- Summary: Add option manually set job ID for job submissions via REST API Key: FLINK-11544 URL: https://issues.apache.org/jira/browse/FLINK-11544 Project: Flink Issue Type: Sub-task Components: REST Affects Versions: 1.7.0 Reporter: Ufuk Celebi Assignee: Ufuk Celebi Add an option to specify the job ID during job submissions via the REST API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11534) Don't exit JVM after job termination with standalone job
Ufuk Celebi created FLINK-11534: --- Summary: Don't exit JVM after job termination with standalone job Key: FLINK-11534 URL: https://issues.apache.org/jira/browse/FLINK-11534 Project: Flink Issue Type: New Feature Components: Cluster Management Affects Versions: 1.7.0 Reporter: Ufuk Celebi If a job deployed in job cluster mode terminates, the JVM running the StandaloneJobClusterEntryPoint will exit via System.exit(1). When managing such a job this requires access to external systems for logging in order to get more details about failure causes or final termination status. I believe that there is value in having a StandaloneJobClusterEntryPoint option that does not exit the JVM after the job has terminated. This allows users to gather further information if they are monitoring the job and manually tear down the process. If there is agreement to have this feature, I would provide the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever
Ufuk Celebi created FLINK-11533: --- Summary: Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever Key: FLINK-11533 URL: https://issues.apache.org/jira/browse/FLINK-11533 Project: Flink Issue Type: New Feature Components: Cluster Management Affects Versions: 1.7.0 Reporter: Ufuk Celebi Users running job clusters distribute their user code as part of the shared classpath of all cluster components. We currently require users running {{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR manifest entries that specify the main class of a JAR are ignored since they are simply part of the classpath. I propose to add another optional command line argument to the {{StandaloneClusterEntryPoint}} that specifies the location of a JAR file (such as {{lib/usercode.jar}}) and whose Manifest is respected. Arguments: {code} --job-jar --job-classname name {code} Each argument is optional, but at least one of the two is required. The job-classname has precedence over job-jar. Implementation wise we should be able to simply create the PackagedProgram from the jar file path in ClassPathJobGraphRetriever. If there is agreement to have this feature, I would provide the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11525) Add option to manually set job ID
Ufuk Celebi created FLINK-11525: --- Summary: Add option to manually set job ID Key: FLINK-11525 URL: https://issues.apache.org/jira/browse/FLINK-11525 Project: Flink Issue Type: New Feature Components: Client, REST Affects Versions: 1.7.0 Reporter: Ufuk Celebi When submitting Flink jobs programmatically it is desirable to have the option to manually set the job ID in order to have idempotent job submissions. This simplifies failure handling on the user side as duplicate submissions will be rejected by Flink. In general allowing to manually set the job ID can be beneficial for third party tooling. The default behavior should not be altered. The following job submission entry points should be extended to allow to specify this option: 1. REST API 2. StandaloneJobClusterEntrypoint 3. CLI Note that for 2., FLINK-10921 already suggested to allow to configure the job ID manually. If there is agreement to have this feature, I'll go ahead and create sub tasks for the mentioned entry points (and provide the implementation for 1. and 2.). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders
Ufuk Celebi created FLINK-11402: --- Summary: User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders Key: FLINK-11402 URL: https://issues.apache.org/jira/browse/FLINK-11402 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.7.0 Reporter: Ufuk Celebi Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz As reported on the user mailing list thread "[`env.java.opts` not persisting after job canceled or failed and then restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];, there can be issues with using native libraries and user code class loading. h2. Steps to reproduce I was able to reproduce the issue reported on the mailing list using [snappy-java|https://github.com/xerial/snappy-java] in a user program. Running the attached user program works fine on initial submission, but results in a failure when re-executed. I'm using Flink 1.7.0 using a standalone cluster started via {{bin/start-cluster.sh}}. 0. Unpack attached Maven project and build using {{mvn clean package}} *or* directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}} 1. Download [snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar] and unpack libsnappyjava for your system: {code} jar tf snappy-java-1.1.7.2.jar | grep libsnappy ... org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so ... org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib ... {code} 2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} (path needs to be adjusted for your system): {code} env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64 {code} 3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}} {code} bin/flink run hello-snappy-1.0-SNAPSHOT.jar Starting execution of program Program execution finished Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished. Job Runtime: 359 ms {code} 4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}} {code} bin/flink run hello-snappy-1.0-SNAPSHOT.jar Starting execution of program The program finished with the following exception: org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 7d69baca58f33180cb9251449ddcd396) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) at com.github.uce.HelloSnappy.main(HelloSnappy.java:18) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265) ... 17 more Caused by: java.lang.UnsatisfiedLinkError: Native Library /.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded in another classloader at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1907) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1861) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:182) at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154) at org.xerial.snappy.Snappy.(Snappy.java:47) at
[jira] [Created] (FLINK-11127) Make metrics query service establish connection to JobManager
Ufuk Celebi created FLINK-11127: --- Summary: Make metrics query service establish connection to JobManager Key: FLINK-11127 URL: https://issues.apache.org/jira/browse/FLINK-11127 Project: Flink Issue Type: Improvement Components: Distributed Coordination, Kubernetes, Metrics Affects Versions: 1.7.0 Reporter: Ufuk Celebi As part of FLINK-10247, the internal metrics query service has been separated into its own actor system. Before this change, the JobManager (JM) queried TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a separate connection to the TM metrics query service actor. In the context of Kubernetes, this is problematic as the JM will typically *not* be able to resolve the TMs by name, resulting in warnings as follows: {code} 2018-12-11 08:32:33,962 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve] {code} In order to expose the TMs by name in Kubernetes, users require a service *for each* TM instance which is not practical. This currently results in the web UI not being to display some basic metrics about number of sent records. You can reproduce this by following the READMEs in {{flink-container/kubernetes}}. This worked before, because the JM is typically exposed via a service with a known name and the TMs establish the connection to it which the metrics query service piggybacked on. A potential solution to this might be to let the query service connect to the JM similar to how the TMs register. I tagged this ticket as an improvement, but in the context of Kubernetes I would consider this to be a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10971) Dependency convergence issue when building flink-s3-fs-presto
Ufuk Celebi created FLINK-10971: --- Summary: Dependency convergence issue when building flink-s3-fs-presto Key: FLINK-10971 URL: https://issues.apache.org/jira/browse/FLINK-10971 Project: Flink Issue Type: Bug Reporter: Ufuk Celebi Trying to trigger a savepoint to S3 with a clean build of {{release-1.7.0-rc2}} results in a {{java.lang.NoSuchMethodError: org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V}}. *Environment* - Tag: {{release-1.7.0-rc2}} - Build command: {{mvn clean package -DskipTests -Dcheckstyle.skip}} - Maven version: {code} mvn -version Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T20:33:14+02:00) Maven home: /usr/local/Cellar/maven/3.5.4/libexec Java version: 1.8.0_192, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.14.1", arch: "x86_64", family: "mac" {code} *Steps to reproduce* {code} cp opt/flink-s3-fs-presto-1.7.0.jar lib bin/start-cluster.sh bin/flink run examples/streaming/TopSpeedWindowing.jar bin/flink savepoint db37f69f21cbe54e9bf6b7e259a9c09e {code} *Stacktrace* {code} The program finished with the following exception: org.apache.flink.util.FlinkException: Triggering a savepoint for the job db37f69f21cbe54e9bf6b7e259a9c09e failed. at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723) at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701) at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985) at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NoSuchMethodError: org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.flink.runtime.jobmaster.JobMaster.lambda$triggerSavepoint$14(JobMaster.java:970) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:292) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756) at org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$8(JobMaster.java:680) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.util.concurrent.CompletionException: java.lang.NoSuchMethodError: org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) ... 12 more Caused by: java.lang.NoSuchMethodError:
[jira] [Created] (FLINK-10948) Add option to write out termination message with application status
Ufuk Celebi created FLINK-10948: --- Summary: Add option to write out termination message with application status Key: FLINK-10948 URL: https://issues.apache.org/jira/browse/FLINK-10948 Project: Flink Issue Type: Improvement Components: Distributed Coordination Reporter: Ufuk Celebi Assignee: Ufuk Celebi I propose to add an option to write out a termination message to a file that indicates the terminal application status. With the change proposed in FLINK-10743, we can't use the exit code to differentiate between cancelled and succeeded applications. The motivating use case for both this ticket and FLINK-10743 are Flink job clusters ({{StandaloneJobClusterEntryPoint}}) with Kubernetes. The idea of the termination message comes from Kubernetes ([https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/)]. With this in place a terminated Pod will report the final status as in: {code:java} state: terminated: exitCode: 0 finishedAt: 2018-11-20T11:00:59Z message: CANCELED # <--- termination message reason: Completed startedAt: 2018-11-20T10:59:18Z {code} The implementation could be done in {{ClusterEntrypoint#runClusterEntrypoint(ClusterEntrypoint)}} which is used by all entry points to run Flink. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10751) Checkpoints should be retained when job reaches suspended state
Ufuk Celebi created FLINK-10751: --- Summary: Checkpoints should be retained when job reaches suspended state Key: FLINK-10751 URL: https://issues.apache.org/jira/browse/FLINK-10751 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.6.2 Reporter: Ufuk Celebi Assignee: Ufuk Celebi {{CheckpointProperties}} define in which terminal job status a checkpoint should be disposed. I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) terminal job status {{SUSPENDED}}. Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses leadership, this would result in the checkpoint to be cleaned up and not being available for recovery by the new leader. Therefore, we should rather retain checkpoints when reaching job status {{SUSPENDED}}. *BUT:* Because we special case this terminal state in the only highly available {{CompletedCheckpointStore}} implementation (see [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315]) and don't use regular checkpoint disposal, this issue has not surfaced yet. I think we should proactively fix the properties to indicate to retain checkpoints in {{SUSPENDED}} state. We might actually completely remove this case since with this change, all properties will indicate to retain on suspension. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10743) Use 0 processExitCode for ApplicationStatus.CANCELED
Ufuk Celebi created FLINK-10743: --- Summary: Use 0 processExitCode for ApplicationStatus.CANCELED Key: FLINK-10743 URL: https://issues.apache.org/jira/browse/FLINK-10743 Project: Flink Issue Type: Bug Components: Cluster Management, Kubernetes, Mesos, YARN Affects Versions: 1.6.3 Reporter: Ufuk Celebi Assignee: Ufuk Celebi {{org.apache.flink.runtime.clusterframework.ApplicationStatus}} is used to map {{org.apache.flink.runtime.jobgraph.JobStatus}} to a process exit code. We currently map {{ApplicationStatus.CANCELED}} to a non-zero exit code ({{1444}}). Since cancellation is a user-triggered operation I would consider this to be a successful exit and map it to exit code {{0}}. Our current behavior results in applications running via the {{StandaloneJobClusterEntryPoint}} and Kubernetes pods as documented in [flink-container|https://github.com/apache/flink/tree/master/flink-container/kubernetes] to be immediately restarted when cancelled. This only leaves the option of killing the respective job cluster master container. The {{ApplicationStatus}} is also used in the YARN and Mesos clients, but I'm not familiar with that part of the code base and can't asses how changing the exit code would affect these clients. A quick usage scan for {{ApplicationStatus.CANCELED}} did not surface any problematic usages though. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10441) Don't log warning when creating upload directory
Ufuk Celebi created FLINK-10441: --- Summary: Don't log warning when creating upload directory Key: FLINK-10441 URL: https://issues.apache.org/jira/browse/FLINK-10441 Project: Flink Issue Type: Improvement Components: REST Reporter: Ufuk Celebi {{RestServerEndpoint.createUploadDir(Path, Logger)}} logs a warning if the upload directory does not exist. {code} 2018-09-26 15:32:31,732 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- Upload directory /var/folders/hr/cxn1_2y52qxf5nzyfq9h2scwgn/T/flink-web-2218b898-f245-4edf-b181-8f3bdc6014f3/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available. {code} I found this warning confusing as it is always logged when relying on the default configuration (via {{WebOptions}}) that picks a random directory. Ideally, our default configurations should not result in warnings to be logged. Therefore, I propose to log the creation of the web directory on {{INFO}} instead of the warning. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10439) Race condition during job suspension
Ufuk Celebi created FLINK-10439: --- Summary: Race condition during job suspension Key: FLINK-10439 URL: https://issues.apache.org/jira/browse/FLINK-10439 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.7.0 Reporter: Ufuk Celebi Attachments: master-logs.log, race-job-suspension.png, worker-logs.log When a {{JobMaster}} in an HA setup looses leadership, it suspends the execution of its job via {{JobMaster.suspend(Exception, Time)}}. This operation involves transitioning to the {{SUSPENDING}} job state and cancelling all running tasks. In some executions it may happen that the job does *not* reach the terminal {{SUSPENDED}} job state. This is due to the fact that suspending the job stops related RPC endpoints such as the {{JobMaster}} or {{SlotPool}} (in {{JobMaster.suspend(Exception, Time)}} and {{JobMaster.suspendExecution( Exception)}}) immediately after suspending. Whenever this happens *before* the {{TaskExecutor}} instances have cancelled or failed the respective tasks, the job does not transition to {{SUSPENDED}}, because the {{ExecutionGraph}} does not receive all {{Execution}} state transitions. In practice, this should not happen frequently due the fact that {{JobMaster}} and {{TaskExecutor}} instances are notified about the loss of leadership (or loss of ZooKeeper connection or similar events) around the same time. In this scenario, the {{TaskExecutor}} instances proactively fail the executing tasks and notify the {{JobMaster}}. All in all, the impact of this is limited by the fact that a new {{JobMaster}} leader will eventually recover the job. *Steps to reproduce*: - Start ZooKeeper - Start a Flink cluster in HA mode and submit job - Stop ZooKeeper In some executions you will find that the job does not reach the terminal state {{SUSPENDED}}. Furthermore, you may see log messages similar to the following in this case: {code} The rpc endpoint org.apache.flink.runtime.jobmaster.slotpool.SlotPool has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.LocalRpcInvocation until processing is started. {code} I've attached a logs of a local run that does not transition to {{SUSPENDED}} and a sequence diagram of what I think may be a problematic timing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10436) Example config uses deprecated key jobmanager.rpc.address
Ufuk Celebi created FLINK-10436: --- Summary: Example config uses deprecated key jobmanager.rpc.address Key: FLINK-10436 URL: https://issues.apache.org/jira/browse/FLINK-10436 Project: Flink Issue Type: Bug Components: Startup Shell Scripts Affects Versions: 1.7.0 Reporter: Ufuk Celebi The example {{flink-conf.yaml}} shipped as part of the Flink distribution (https://github.com/apache/flink/blob/master/flink-dist/src/main/resources/flink-conf.yaml) has the following entry: {code} jobmanager.rpc.address: localhost {code} When using this key, the following deprecation warning is logged. {code} 2018-09-26 12:01:46,608 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address' {code} The example config should not use deprecated config options. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10318) Add option to build Hadoop-free job image to build.sh
Ufuk Celebi created FLINK-10318: --- Summary: Add option to build Hadoop-free job image to build.sh Key: FLINK-10318 URL: https://issues.apache.org/jira/browse/FLINK-10318 Project: Flink Issue Type: Improvement Components: Docker Affects Versions: 1.6.0 Reporter: Ufuk Celebi When building a Job-specific image from a release via {{flink-container/docker/build.sh}}, we require to specify a Hadoop version: {code} ./build.sh --job-jar flink-job.jar --from-release --flink-version 1.6.0 --hadoop-version 2.8 # <- currently required --scala-version 2.11 --image-name flink-job {code} I think for many users a Hadoop-free build is a good default. We should consider supporting this out of the box with the {{build.sh}} script. The current work around would be to manually download the Hadoop-free release and build with the {{--from-archive}} flag. Another alternative would be to drop the {{from-release}} option and document how to build from an archive with links to the downloads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9832) Allow commas in job submission query params
Ufuk Celebi created FLINK-9832: -- Summary: Allow commas in job submission query params Key: FLINK-9832 URL: https://issues.apache.org/jira/browse/FLINK-9832 Project: Flink Issue Type: Bug Components: REST Affects Versions: 1.5.1 Reporter: Ufuk Celebi As reported on the user mailing list in the thread "Run programs w/ params including comma via REST api" [1], submitting a job with mainArgs that include a comma results in an exception. To reproduce submit a job with the following mainArgs: {code} --servers 10.100.98.9:9092,10.100.98.237:9092 {code} The request fails with {code} Expected only one value [--servers 10.100.98.9:9092, 10.100.98.237:9092]. {code} As a work around, users have to use a different delimiter such as {{;}}. The proper fix of this API would make these params part of the {{POST}} request instead of relying on query params (as noted in FLINK-9499). [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Run-programs-w-params-including-comma-via-REST-api-td19662.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9690) Restoring state with FlinkKafkaProducer and Kafka 1.1.0 client fails
Ufuk Celebi created FLINK-9690: -- Summary: Restoring state with FlinkKafkaProducer and Kafka 1.1.0 client fails Key: FLINK-9690 URL: https://issues.apache.org/jira/browse/FLINK-9690 Project: Flink Issue Type: Improvement Components: Kafka Connector Affects Versions: 1.4.2 Reporter: Ufuk Celebi Restoring a job from a savepoint that includes {{FlinkKafkaProducer}} packaged with {{kafka.version}} set to {{1.1.0}} in Flink 1.4.2. {code} java.lang.RuntimeException: Incompatible KafkaProducer version at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:301) at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:292) at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.resumeTransaction(FlinkKafkaProducer.java:195) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.recoverAndCommit(FlinkKafkaProducer011.java:723) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.recoverAndCommit(FlinkKafkaProducer011.java:93) at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.recoverAndCommitInternal(TwoPhaseCommitSinkFunction.java:370) at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.initializeState(TwoPhaseCommitSinkFunction.java:330) at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.initializeState(FlinkKafkaProducer011.java:856) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:258) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NoSuchFieldException: sequenceNumbers at java.lang.Class.getDeclaredField(Class.java:2070) at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:297) ... 16 more {code} [~pnowojski] Any ideas about this issue? Judging from the stack trace it was anticipated that reflective access might break with Kafka versions > 0.11.2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9389) Improve error message when cancelling job in state canceled via REST API
Ufuk Celebi created FLINK-9389: -- Summary: Improve error message when cancelling job in state canceled via REST API Key: FLINK-9389 URL: https://issues.apache.org/jira/browse/FLINK-9389 Project: Flink Issue Type: Improvement Components: REST Affects Versions: 1.5.0 Reporter: Ufuk Celebi - Request job that has been already cancelled {code} $ curl http://localhost:8081/jobs/b577a914ecdf710d4b93a84105dea0c9 {"jid":"b577a914ecdf710d4b93a84105dea0c9","name":"CarTopSpeedWindowingExample","isStoppable":false,"state":"CANCELED", ...omitted...} {code} - Cancel job again {code} $ curl -XPATCH http://localhost:8081/jobs/b577a914ecdf710d4b93a84105dea0c9\?mode\=cancel {"errors":["Job could not be found."]} (HTTP 404) {code} Since the actual job resource still exists, I think status code 404 is confusing. I think a better message should indicate that the job is already in state CANCELED. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-8304) Document Kubernetes and Flink HA setup
Ufuk Celebi created FLINK-8304: -- Summary: Document Kubernetes and Flink HA setup Key: FLINK-8304 URL: https://issues.apache.org/jira/browse/FLINK-8304 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Currently the Flink on Kubernetes documentation does not mention anything about running Flink in HA mode. We should add at least the following two things: - Currently, there cannot be a standby JobManager pod due to the way Flink HA works - `high-availability.jobmanager.port` has to be set to a port that is exposed via Kubernetes -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-8253) flink-docs-stable still points to 1.3
Ufuk Celebi created FLINK-8253: -- Summary: flink-docs-stable still points to 1.3 Key: FLINK-8253 URL: https://issues.apache.org/jira/browse/FLINK-8253 Project: Flink Issue Type: Bug Components: Documentation, Project Website Reporter: Ufuk Celebi Assignee: Ufuk Celebi https://ci.apache.org/projects/flink/flink-docs-stable/ redirects to 1.3. As noted in https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release we have to update the stable alias in flink.conf found in https://svn.apache.org/repos/infra/infrastructure/buildbot/aegis/buildmaster/master1/projects. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-8138) Race in TaskAsyncCallTest leads to test time out
Ufuk Celebi created FLINK-8138: -- Summary: Race in TaskAsyncCallTest leads to test time out Key: FLINK-8138 URL: https://issues.apache.org/jira/browse/FLINK-8138 Project: Flink Issue Type: Bug Components: Tests Reporter: Ufuk Celebi Priority: Minor {{TaskAsyncCallTest#testSetsUserCodeClassLoader}} times out with a stack trace on Travis on a personal branch with unrelated changes on top of 1.4.0 RC 0. I've attached the Travis output to this issue. The main thread is stuck in {code} "main" #1 prio=5 os_prio=0 tid=0x7ff59000a000 nid=0xb9b in Object.wait() [0x7ff598965000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x833994c8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:56) - locked <0x833994c8> (a java.lang.Object) at org.apache.flink.runtime.taskmanager.TaskAsyncCallTest.testSetsUserCodeClassLoader(TaskAsyncCallTest.java:201) {code} There are no other Flink related threads alive. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7677) Add side outputs to ProcessWindowFunction
Ufuk Celebi created FLINK-7677: -- Summary: Add side outputs to ProcessWindowFunction Key: FLINK-7677 URL: https://issues.apache.org/jira/browse/FLINK-7677 Project: Flink Issue Type: Improvement Components: DataStream API Affects Versions: 1.4.0 Reporter: Ufuk Celebi As per discussion on the user mailing list [1], we should add the required context to collect to side output in ProcessWindowFunctions similar to ProcessFunction. [1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/question-on-sideoutput-from-ProcessWindow-function-td15500.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7676) ContinuousFileMonitoringFunction fails with GoogleHadoopFileSystem
Ufuk Celebi created FLINK-7676: -- Summary: ContinuousFileMonitoringFunction fails with GoogleHadoopFileSystem Key: FLINK-7676 URL: https://issues.apache.org/jira/browse/FLINK-7676 Project: Flink Issue Type: Bug Components: Streaming Connectors Reporter: Ufuk Celebi Priority: Minor The following check in ContinuousFileMonitoringFunction fails when running against a file in Google Cloud Storage: {code} Path p = new Path(path); FileSystem fileSystem = FileSystem.get(p.toUri()); if (fileSystem.exists(p)) { throw new FileNotFoundException("The provided file path " + path + " does not exist."); } {code} I suspect this has something to do with consistency guarantees provided by GCS. I'm wondering if it's better to fail lazily at a later stage (e.g. when opening the stream and it doesn't work). After removing this check, everything works as expected. I can also run a batch WordCount job against the same file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7674) NullPointerException in ContinuousFileMonitoringFunction close
Ufuk Celebi created FLINK-7674: -- Summary: NullPointerException in ContinuousFileMonitoringFunction close Key: FLINK-7674 URL: https://issues.apache.org/jira/browse/FLINK-7674 Project: Flink Issue Type: Bug Components: Streaming Connectors Affects Versions: 1.4.0 Reporter: Ufuk Celebi Priority: Minor If the ContinuousFileMonitoringFunction is closed before run is called (because initialization fails), we get a NullPointerException, because checkpointLock has not been set. {code} synchronized (checkpointLock) { globalModificationTime = Long.MAX_VALUE; isRunning = false; } {code} This results in a follow up error log: {code} 2017-09-23 10:25:04,096 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error during disposal of stream operator. java.lang.NullPointerException at org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.close(ContinuousFileMonitoringFunction.java:337) at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126) at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7666) ContinuousFileReaderOperator swallows chained watermarks
Ufuk Celebi created FLINK-7666: -- Summary: ContinuousFileReaderOperator swallows chained watermarks Key: FLINK-7666 URL: https://issues.apache.org/jira/browse/FLINK-7666 Project: Flink Issue Type: Improvement Components: Streaming Connectors Affects Versions: 1.3.2 Reporter: Ufuk Celebi I use event time and read from a (finite) file. I assign watermarks right after the {{ContinuousFileReaderOperator}} with parallelism 1. {code} env .readFile(new TextInputFormat(...), ...) .setParallelism(1) .assignTimestampsAndWatermarks(...) .setParallelism(1) .map()... {code} The watermarks I assign never progress through the pipeline. I can work around this by inserting a {{shuffle()}} after the file reader or starting a new chain at the assigner: {code} env .readFile(new TextInputFormat(...), ...) .setParallelism(1) .shuffle() .assignTimestampsAndWatermarks(...) .setParallelism(1) .map()... {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7665) Use wait/notify in ContinuousFileReaderOperator
Ufuk Celebi created FLINK-7665: -- Summary: Use wait/notify in ContinuousFileReaderOperator Key: FLINK-7665 URL: https://issues.apache.org/jira/browse/FLINK-7665 Project: Flink Issue Type: Improvement Components: Streaming Connectors Affects Versions: 1.4.0 Reporter: Ufuk Celebi Priority: Minor {{ContinuousFileReaderOperator}} has the following loop to receive input splits: {code} synchronized (checkpointLock) { if (currentSplit == null) { currentSplit = this.pendingSplits.poll(); if (currentSplit == null) { if (this.shouldClose) { isRunning = false; } else { checkpointLock.wait(50); } continue; } } } {code} I think we can replace this with a {{wait()}} and {{notify()}} in {{addSplit}} and {{close}}. If there is a reason to keep the {{wait(50)}}, feel free to close this issue. :-) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7643) HadoopFileSystem always reloads GlobalConfiguration
Ufuk Celebi created FLINK-7643: -- Summary: HadoopFileSystem always reloads GlobalConfiguration Key: FLINK-7643 URL: https://issues.apache.org/jira/browse/FLINK-7643 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi HadoopFileSystem always reloads GlobalConfiguration, which potentially leads to a lot of noise in the logs, because this happens on each checkpoint. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7595) Removing stateless task from task chain breaks savepoint restore
Ufuk Celebi created FLINK-7595: -- Summary: Removing stateless task from task chain breaks savepoint restore Key: FLINK-7595 URL: https://issues.apache.org/jira/browse/FLINK-7595 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi Assignee: Chesnay Schepler When removing a stateless operator from a 2-task chain where the head operator is stateful breaks savepoint restore with {code} Caused by: java.lang.IllegalStateException: Failed to rollback to savepoint /var/folders/py/s_1l8vln6f19ygc77m8c4zhrgn/T/junit1167397515334838028/junit8006766303945373008/savepoint-cb0bcf-3cfa67865ac0. Cannot map savepoint state... {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7258) IllegalArgumentException in Netty bootstrap with large memory state segment size
Ufuk Celebi created FLINK-7258: -- Summary: IllegalArgumentException in Netty bootstrap with large memory state segment size Key: FLINK-7258 URL: https://issues.apache.org/jira/browse/FLINK-7258 Project: Flink Issue Type: Bug Components: Network Affects Versions: 1.3.1 Reporter: Ufuk Celebi Assignee: Ufuk Celebi In NettyBootstrap we configure the low and high watermarks in the following order: {code} bootstrap.childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, config.getMemorySegmentSize() + 1); bootstrap.childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, 2 * config.getMemorySegmentSize()); {code} When the memory segment size is higher than the default high water mark, this throws an `IllegalArgumentException` when a client tries to connect. Hence, this unfortunately only fails during runtime when a intermediate result is requested. A simple fix is to first configure the high water mark and only then configure the low watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7127) Remove unnecessary null check or add null check
Ufuk Celebi created FLINK-7127: -- Summary: Remove unnecessary null check or add null check Key: FLINK-7127 URL: https://issues.apache.org/jira/browse/FLINK-7127 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi In {{HeapKeyedStateBackend#snapshot}} we have: {code} for (Map.Entry> kvState : stateTables.entrySet()) { // 1) Here we don't check for null metaInfoSnapshots.add(kvState.getValue().getMetaInfo().snapshot()); kVStateToId.put(kvState.getKey(), kVStateToId.size()); // 2) Here we check for null StateTable stateTable = kvState.getValue(); if (null != stateTable) { cowStateStableSnapshots.put(stateTable, stateTable.createSnapshot()); } } {code} Either this can lead to a NPE and we should check it in 1) or we remove the null check in 2). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7067) Cancel with savepoint does not restart checkpoint scheduler on failure
Ufuk Celebi created FLINK-7067: -- Summary: Cancel with savepoint does not restart checkpoint scheduler on failure Key: FLINK-7067 URL: https://issues.apache.org/jira/browse/FLINK-7067 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.3.1 Reporter: Ufuk Celebi The `CancelWithSavepoint` action of the JobManager first stops the checkpoint scheduler, then triggers a savepoint, and cancels the job after the savepoint completes. If the savepoint fails, the command should not have any side effects and we don't cancel the job. The issue is that the checkpoint scheduler is not restarted though. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6952) Add link to Javadocs
Ufuk Celebi created FLINK-6952: -- Summary: Add link to Javadocs Key: FLINK-6952 URL: https://issues.apache.org/jira/browse/FLINK-6952 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor The project webpage and the docs are missing links to the Javadocs. I think we should add them as part of the external links at the bottom of the doc navigation (above "Project Page"). In the same manner we could add a link to the Scaladocs, but if I remember correctly there was a problem with the build of the Scaladocs. Correct, [~aljoscha]? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6182) Fix possible NPE in SourceStreamTask
Ufuk Celebi created FLINK-6182: -- Summary: Fix possible NPE in SourceStreamTask Key: FLINK-6182 URL: https://issues.apache.org/jira/browse/FLINK-6182 Project: Flink Issue Type: Bug Components: Local Runtime Reporter: Ufuk Celebi Priority: Minor If SourceStreamTask is cancelled before being invoked, `headOperator` is not set yet, which leads to an NPE. This is not critical but leads to noisy logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6175) HistoryServerTest.testFullArchiveLifecycle fails
Ufuk Celebi created FLINK-6175: -- Summary: HistoryServerTest.testFullArchiveLifecycle fails Key: FLINK-6175 URL: https://issues.apache.org/jira/browse/FLINK-6175 Project: Flink Issue Type: Test Components: Tests, Webfrontend Reporter: Ufuk Celebi Assignee: Chesnay Schepler https://s3.amazonaws.com/archive.travis-ci.org/jobs/213933823/log.txt {code} estFullArchiveLifecycle(org.apache.flink.runtime.webmonitor.history.HistoryServerTest) Time elapsed: 2.162 sec <<< FAILURE! java.lang.AssertionError: /joboverview.json did not contain valid json at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertNotNull(Assert.java:712) at org.apache.flink.runtime.webmonitor.history.HistoryServerTest.testFullArchiveLifecycle(HistoryServerTest.java:98) {code} Happened on a branch with unrelated changes [~Zentol]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6171) Some checkpoint metrics rely on latest stat snapshot
Ufuk Celebi created FLINK-6171: -- Summary: Some checkpoint metrics rely on latest stat snapshot Key: FLINK-6171 URL: https://issues.apache.org/jira/browse/FLINK-6171 Project: Flink Issue Type: Bug Components: Metrics, State Backends, Checkpointing, Webfrontend Reporter: Ufuk Celebi Some checkpoint metrics use the latest stats snapshot to get the returned metric value. These snapshots are only updated when the {{WebRuntimeMonitor}} actually requests some stats (web UI or REST API). In practice, this means that these metrics are only updated when users are browsing the web UI. Instead of relying on the latest snapshot, the checkpoint metrics should be directly updated via the completion callbacks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6170) Some checkpoint metrics rely on latest stat snapshot
Ufuk Celebi created FLINK-6170: -- Summary: Some checkpoint metrics rely on latest stat snapshot Key: FLINK-6170 URL: https://issues.apache.org/jira/browse/FLINK-6170 Project: Flink Issue Type: Bug Components: Metrics, State Backends, Checkpointing, Webfrontend Reporter: Ufuk Celebi Some checkpoint metrics use the latest stats snapshot to get the returned metric value. These snapshots are only updated when the {{WebRuntimeMonitor}} actually requests some stats (web UI or REST API). In practice, this means that these metrics are only updated when users are browsing the web UI. Instead of relying on the latest snapshot, the checkpoint metrics should be directly updated via the completion callbacks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6127) Add MissingDeprecatedCheck to checkstyle
Ufuk Celebi created FLINK-6127: -- Summary: Add MissingDeprecatedCheck to checkstyle Key: FLINK-6127 URL: https://issues.apache.org/jira/browse/FLINK-6127 Project: Flink Issue Type: Improvement Components: Build System Reporter: Ufuk Celebi Priority: Minor We should add the MissingDeprecatedCheck to our checkstyle rules to help avoiding deprecations without JavaDocs mentioning why the deprecation happened. http://checkstyle.sourceforge.net/apidocs/com/puppycrawl/tools/checkstyle/checks/annotation/MissingDeprecatedCheck.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5999) MiniClusterITCase.runJobWithMultipleRpcServices fails
Ufuk Celebi created FLINK-5999: -- Summary: MiniClusterITCase.runJobWithMultipleRpcServices fails Key: FLINK-5999 URL: https://issues.apache.org/jira/browse/FLINK-5999 Project: Flink Issue Type: Test Components: Distributed Coordination, Tests Reporter: Ufuk Celebi In a branch with unrelated changes to the web frontend I've seen the following test fail: {code} runJobWithMultipleRpcServices(org.apache.flink.runtime.minicluster.MiniClusterITCase) Time elapsed: 1.145 sec <<< ERROR! java.util.ConcurrentModificationException: null at java.util.HashMap$HashIterator.nextNode(HashMap.java:1429) at java.util.HashMap$ValueIterator.next(HashMap.java:1458) at org.apache.flink.runtime.resourcemanager.JobLeaderIdService.clear(JobLeaderIdService.java:114) at org.apache.flink.runtime.resourcemanager.JobLeaderIdService.stop(JobLeaderIdService.java:92) at org.apache.flink.runtime.resourcemanager.ResourceManager.shutDown(ResourceManager.java:182) at org.apache.flink.runtime.resourcemanager.ResourceManagerRunner.shutDownInternally(ResourceManagerRunner.java:83) at org.apache.flink.runtime.resourcemanager.ResourceManagerRunner.shutDown(ResourceManagerRunner.java:78) at org.apache.flink.runtime.minicluster.MiniCluster.shutdownInternally(MiniCluster.java:313) at org.apache.flink.runtime.minicluster.MiniCluster.shutdown(MiniCluster.java:281) at org.apache.flink.runtime.minicluster.MiniClusterITCase.runJobWithMultipleRpcServices(MiniClusterITCase.java:72) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5993) Add section about Azure deployment
Ufuk Celebi created FLINK-5993: -- Summary: Add section about Azure deployment Key: FLINK-5993 URL: https://issues.apache.org/jira/browse/FLINK-5993 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Running Flink on Azure can lead to unexpected problems. From the user mailing list: {quote} For anyone seeing this thread in the future, we managed to solve the issue. The problem was in the Azure storage SDK. Flink is using Hadoop 2.7, so we used version 2.7.3 of the Hadoop-azure package. This package uses version 2.0.0 of the azure-storage package, dated from 2014. It has several bugs that were since fixed, specifically one where the socket timeout was infinite. We updated this package to version 5.0.0 and everything is working smoothly now. {quote} (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-td11776.html) At least this caveat should be covered. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5928) Externalized checkpoints overwritting each other
Ufuk Celebi created FLINK-5928: -- Summary: Externalized checkpoints overwritting each other Key: FLINK-5928 URL: https://issues.apache.org/jira/browse/FLINK-5928 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Critical I noticed that PR #3346 accidentally broke externalized checkpoints by using a fixed meta data file name. We should restore the old behaviour with creating random files and double check why no test caught this. This will likely superseded by upcoming changes from [~StephanEwen] to use metadata streams on the JobManager. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5926) Show state backend information in web UI
Ufuk Celebi created FLINK-5926: -- Summary: Show state backend information in web UI Key: FLINK-5926 URL: https://issues.apache.org/jira/browse/FLINK-5926 Project: Flink Issue Type: Improvement Components: Webfrontend Reporter: Ufuk Celebi Priority: Minor With #3411 and follow ups, the state backends will be available at the job manager via the snapshot settings. We should extend the checkpoints/configuration tab in the web UI to also show the configured state backend. https://github.com/apache/flink/pull/3411 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5923) Test instability in SavepointITCase testTriggerSavepointAndResume
Ufuk Celebi created FLINK-5923: -- Summary: Test instability in SavepointITCase testTriggerSavepointAndResume Key: FLINK-5923 URL: https://issues.apache.org/jira/browse/FLINK-5923 Project: Flink Issue Type: Test Components: Tests Reporter: Ufuk Celebi Assignee: Ufuk Celebi https://s3.amazonaws.com/archive.travis-ci.org/jobs/205042538/log.txt {code} Failed tests: SavepointITCase.testTriggerSavepointAndResume:258 Checkpoints directory not cleaned up: [/tmp/junit1029044621247843839/junit7338507921051602138/checkpoints/47fa12635d098bdafd52def453e6d66c/chk-4] expected:<0> but was:<1> {code} I think this is due to a race in the test. When shutting down the cluster it can happen that in progress checkpoints linger around. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5876) Mention Scala type fallacies for queryable state client serializers
Ufuk Celebi created FLINK-5876: -- Summary: Mention Scala type fallacies for queryable state client serializers Key: FLINK-5876 URL: https://issues.apache.org/jira/browse/FLINK-5876 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi FLINK-5801 shows a very hard to debug issue when querying state from the Scala API that can result in serializer mismatches between the client and the job. We should mention that the type serializers should be created via the Scala macros. {code} import org.apache.flink.streaming.api.scala._ ... val keySerializer = createTypeInformation[Long].createSerializer(new ExecutionConfig) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5824) Fix String/byte conversion without explicit encodings
Ufuk Celebi created FLINK-5824: -- Summary: Fix String/byte conversion without explicit encodings Key: FLINK-5824 URL: https://issues.apache.org/jira/browse/FLINK-5824 Project: Flink Issue Type: Bug Components: Python API, Queryable State, State Backends, Checkpointing, Webfrontend Reporter: Ufuk Celebi In a couple of places we convert Strings to bytes and bytes back to Strings without explicitly specifying an encoding. This can lead to problems when client and server default encodings differ. The task of this JIRA is to go over the whole project and look for conversions where we don't specify an encoding and fix it to specify UTF-8 explicitly. For starters, we can {{grep -R 'getBytes()' .}}, which already reveals many problematic places. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5781) Generation HTML from ConfigOption
Ufuk Celebi created FLINK-5781: -- Summary: Generation HTML from ConfigOption Key: FLINK-5781 URL: https://issues.apache.org/jira/browse/FLINK-5781 Project: Flink Issue Type: Sub-task Components: Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi Use the ConfigOption instances to generate a HTML page that we can use to include in the docs configuration page. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5780) Extend ConfigOption with descriptions
Ufuk Celebi created FLINK-5780: -- Summary: Extend ConfigOption with descriptions Key: FLINK-5780 URL: https://issues.apache.org/jira/browse/FLINK-5780 Project: Flink Issue Type: Sub-task Components: Core, Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi The {{ConfigOption}} type is meant to replace the flat {{ConfigConstants}}. As part of automating the generation of a docs config page we need to extend {{ConfigOption}} with description fields. >From the ML discussion, these could be: {code} void shortDescription(String); void longDescription(String); {code} In practice, the description string should contain HTML/Markdown. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5779) Auto generate configuration docs
Ufuk Celebi created FLINK-5779: -- Summary: Auto generate configuration docs Key: FLINK-5779 URL: https://issues.apache.org/jira/browse/FLINK-5779 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi As per discussion on the mailing list we need to improve on the configuration documentation page (http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Discuss-Organizing-Documentation-for-Configuration-Options-td15773.html). We decided to try to (semi) automate this in order to not get of sync in the future. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5778) Split FileStateHandle into fileName and basePath
Ufuk Celebi created FLINK-5778: -- Summary: Split FileStateHandle into fileName and basePath Key: FLINK-5778 URL: https://issues.apache.org/jira/browse/FLINK-5778 Project: Flink Issue Type: Sub-task Components: State Backends, Checkpointing Reporter: Ufuk Celebi Assignee: Ufuk Celebi Store the statePath as a basePath and a fileName and allow to overwrite the basePath. We cannot overwrite the base path as long as the state handle is still in flight and not persisted. Otherwise we risk a resource leak. We need this in order to be able to relocate savepoints. {code} interface RelativeBaseLocationStreamStateHandle { void clearBaseLocation(); void setBaseLocation(String baseLocation); } {code} FileStateHandle should implement this and the SavepointSerializer should forward the calls when a savepoint is stored or loaded, clear before store and set after load. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5777) Pass savepoint information to CheckpointingOperation
Ufuk Celebi created FLINK-5777: -- Summary: Pass savepoint information to CheckpointingOperation Key: FLINK-5777 URL: https://issues.apache.org/jira/browse/FLINK-5777 Project: Flink Issue Type: Sub-task Components: State Backends, Checkpointing Reporter: Ufuk Celebi Assignee: Ufuk Celebi In order to make savepoints self contained in a single directory, we need to pass some information to {{StreamTask#CheckpointingOperation}}. I propose to extend the {{CheckpointMetaData}} for this. We currently have some overlap with CheckpointMetaData, CheckpointMetrics, and manually passed checkpoint ID and checkpoint timestamps. We should restrict CheckpointMetaData to the integral meta data that needs to be passed to StreamTask#CheckpointingOperation. This means that we move the CheckpointMetrics out of the CheckpointMetaData and the BarrierBuffer/BarrierTracker create CheckpointMetrics separately and send it back with the acknowledge message. CheckpointMetaData should be extended with the following properties: - boolean isSavepoint - String targetDirectory There are two code paths that lead to the CheckpointingOperation: 1. From CheckpointCoordinator via RPC to StreamTask#triggerCheckpoint - Execution#triggerCheckpoint(long, long) => triggerCheckpoint(CheckpointMetaData) - TaskManagerGateway#triggerCheckpoint(ExecutionAttemptID, JobID, long, long) => TaskManagerGateway#triggerCheckpoint(ExecutionAttemptID, JobID, CheckpointMetaData) - Task#triggerCheckpointBarrier(long, long) => Task#triggerCheckpointBarrier(CheckpointMetaData) 2. From intermediate streams via the CheckpointBarrier to StreamTask#triggerCheckpointOnBarrier - triggerCheckpointOnBarrier(CheckpointMetaData) => triggerCheckpointOnBarrier(CheckpointMetaData, CheckpointMetrics) - CheckpointBarrier(long, long) => CheckpointBarrier(CheckpointMetaData) - AcknowledgeCheckpoint(CheckpointMetaData) => AcknowledgeCheckpoint(long, CheckpointMetrics) The state backends provide another stream factory that is called in CheckpointingOperation when the meta data indicates savepoint. The state backends can choose whether they return the regular checkpoint stream factory in that case or a special one for savepoints. That way backends that don’t checkpoint to a file system can special case savepoints easily. - FsStateBackend: return special FsCheckpointStreamFactory with different directory layout - MemoryStateBackend: return regular checkpoint stream factory (MemCheckpointStreamFactory) => The _metadata file will contain all state as the state handles are part of it -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5765) Add noindex meta tag for older docs
Ufuk Celebi created FLINK-5765: -- Summary: Add noindex meta tag for older docs Key: FLINK-5765 URL: https://issues.apache.org/jira/browse/FLINK-5765 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Priority: Minor We shoudl add the noindex meta tags (https://support.google.com/webmasters/answer/93710?hl=en_topic=4598466) to doc versions that are not stable and not latest snapshot, e.g. all docs <= 1.1. There are some config values in the {{_config.yml}} file that we can use to do this automatically. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5764) Don't serve ancient docs
Ufuk Celebi created FLINK-5764: -- Summary: Don't serve ancient docs Key: FLINK-5764 URL: https://issues.apache.org/jira/browse/FLINK-5764 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Priority: Minor We serve very old doc versions that are quite popular when searching for Flink. An example from a discussion on the mailing list: Here's a concrete example. Do a google search for "flink hello world". The first three results are: https://ci.apache.org/projects/flink/flink-docs-release-0.7/java_api_quickstart.html https://ci.apache.org/projects/flink/flink-docs-release-0.8/scala_api_quickstart.html https://ci.apache.org/projects/flink/flink-docs-release-1.2/.../java_api_quickstart.html (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Remove-old-lt-1-0-documentation-from-google-td11541.html) We should remove these old versions and let them redirect to the latest stable docs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5745) Set uncaught exception handler for Netty threads
Ufuk Celebi created FLINK-5745: -- Summary: Set uncaught exception handler for Netty threads Key: FLINK-5745 URL: https://issues.apache.org/jira/browse/FLINK-5745 Project: Flink Issue Type: Improvement Components: Network Reporter: Ufuk Celebi Priority: Minor We pass a thread factory for the Netty event loop threads (see {{NettyServer}} and {{NettyClient}}), but don't set an uncaught exception handler. Let's add a JVM terminating handler that exits the process in cause of fatal errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5731) Split up CI builds
Ufuk Celebi created FLINK-5731: -- Summary: Split up CI builds Key: FLINK-5731 URL: https://issues.apache.org/jira/browse/FLINK-5731 Project: Flink Issue Type: Improvement Components: Build System, Tests Reporter: Ufuk Celebi Assignee: Robert Metzger Priority: Critical Test builds regularly time out because we are hitting the Travis 50 min limit. Previously, we worked around this by splitting up the tests into groups. I think we have to split them further. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5675) Improve QueryableStateClient API
Ufuk Celebi created FLINK-5675: -- Summary: Improve QueryableStateClient API Key: FLINK-5675 URL: https://issues.apache.org/jira/browse/FLINK-5675 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Issue to track queryable state client API changes. This is created in order to group some of the queryable state issues that we created for 1.2 but couldn't address. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5670) Local RocksDB directories not cleaned up
Ufuk Celebi created FLINK-5670: -- Summary: Local RocksDB directories not cleaned up Key: FLINK-5670 URL: https://issues.apache.org/jira/browse/FLINK-5670 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Minor After cancelling a job with a RocksDB backend all files are properly cleaned up, but the parent directories still exist and are empty: {code} 859546fec3dac36bb9fcc8cbdd4e291e +- StreamFlatMap_3_0 +- StreamFlatMap_3_3 +- StreamFlatMap_3_4 +- StreamFlatMap_3_5 +- StreamFlatMap_3_6 {code} The number of empty folders varies between runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5666) Blob files are not cleaned up from ZK storage directory
Ufuk Celebi created FLINK-5666: -- Summary: Blob files are not cleaned up from ZK storage directory Key: FLINK-5666 URL: https://issues.apache.org/jira/browse/FLINK-5666 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi When running a job with HA in an standalone cluster, the blob files are not cleaned up from the ZooKeeper storage directory. {{:zkStorageDir/blob/cache/blob_:rand}} [~NicoK] Have you seen such a behaviour while refactoring the blob server? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5665) Lingering files after failed checkpoint
Ufuk Celebi created FLINK-5665: -- Summary: Lingering files after failed checkpoint Key: FLINK-5665 URL: https://issues.apache.org/jira/browse/FLINK-5665 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi When I discovered FLINK-5663, the first two checkpoints did not go through because of the exception at the task manager and I got a bunch of zero byte files: {code} total 0 drwxr-xr-x 29 uce staff 986B Jan 26 17:29 . drwxr-xr-x 6 uce staff 204B Jan 26 17:45 .. -rw-r--r-- 1 uce staff 0B Jan 26 17:29 03b9af6c-cff4-44f5-b235-20a1acae57c2 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 091cf308-a15b-42b0-9987-25932b3a7dc9 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 1f4850e2-bd46-4ec4-9540-6f22631b7b60 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 24195811-87cc-4cf9-b67a-c842aa469590 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 2d03efcd-2cbb-4f1f-88c7-2977215890ca -rw-r--r-- 1 uce staff 0B Jan 26 17:29 2d9ede49-fd37-479e-9973-6af3030464ef -rw-r--r-- 1 uce staff 0B Jan 26 17:29 39f5ef6e-e2c6-4bad-aeb4-0579b6d06d51 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 487eb08f-60d1-44a3-92eb-ad00552047b0 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 50702820-2006-43d4-9528-c5d37652c782 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 515b5222-f996-4c1f-9f1f-1086040531dc -rw-r--r-- 1 uce staff 0B Jan 26 17:29 647826ae-64e8-4407-94d8-c7f8cc629e78 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 6a211145-74a0-4711-bcba-d55e3ce2516d -rw-r--r-- 1 uce staff 0B Jan 26 17:29 6c99c24a-90ed-45ad-928a-c78cb677646c -rw-r--r-- 1 uce staff 0B Jan 26 17:29 752045e3-1db3-4b4e-8149-15f619c6ecc2 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 774a364b-43a3-4f29-a127-ea54f8ccb093 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 7d93ecbe-5138-432c-bdcc-dd9b8bfdd92e -rw-r--r-- 1 uce staff 0B Jan 26 17:29 7f664b11-afb0-45ca-9371-0c28c59d6e15 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 8fb1f4d6-3e32-46ff-af40-5d9c2193bc96 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 94d41bdc-73b4-45ca-b37c-9d8c46114530 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 af8ba3b6-036c-4c19-a78a-922c6b9dde35 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 b0c5ba31-677c-4bf7-a82b-b2b3df2f8d43 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 ba748b35-d6b6-4734-b28d-f82917aab54c -rw-r--r-- 1 uce staff 0B Jan 26 17:29 be3ea3c6-7ebb-4a07-a48d-5c5d3e2fc1ea -rw-r--r-- 1 uce staff 0B Jan 26 17:29 e0cb285c-ec38-4390-ae98-bec95c520743 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 ea14095a-3527-4762-888a-74308010a702 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 eac774d5-1564-41c1-9015-4c64701ff4bf -rw-r--r-- 1 uce staff 0B Jan 26 17:29 ebf3e9c7-cd96-4f32-a4c0-2ec3d8f44e60 {code} {code} total 0 drwxr-xr-x 29 uce staff 986B Jan 26 17:29 . drwxr-xr-x 6 uce staff 204B Jan 26 17:45 .. -rw-r--r-- 1 uce staff 0B Jan 26 17:29 00cb4c5b-b2d2-4934-9f90-e68606038bbc -rw-r--r-- 1 uce staff 0B Jan 26 17:29 01ca2c72-b861-4d64-bee9-3b72f10f91cc -rw-r--r-- 1 uce staff 0B Jan 26 17:29 0a27c89b-cb7f-433f-9980-fda6bf8265cc -rw-r--r-- 1 uce staff 0B Jan 26 17:29 0bff0fe5-9b46-42f6-94be-a0f3bfa4c77d -rw-r--r-- 1 uce staff 0B Jan 26 17:29 0c8c85fc-71fe-4feb-bb06-d4fb5004281b -rw-r--r-- 1 uce staff 0B Jan 26 17:29 2506b73f-790a-4a3c-b8b9-8296766760ef -rw-r--r-- 1 uce staff 0B Jan 26 17:29 260fc17d-1de3-4d71-ba99-ce9b4aa44f19 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 45399fab-c7df-4c21-9b2f-e1047a82aa74 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 4e67ff32-2aea-4c6f-a061-7152420684eb -rw-r--r-- 1 uce staff 0B Jan 26 17:29 4ffe9e50-e7c9-4e52-8725-8d63f5c50d17 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 57e30213-45af-4645-83e3-109ab7b52829 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 5ab0676d-5af6-4229-8af3-2206a4b4c263 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 6240e59a-d7b0-468f-b228-9f3c1318c98f -rw-r--r-- 1 uce staff 0B Jan 26 17:29 63c099bc-a30e-4780-a390-f007c468e50e -rw-r--r-- 1 uce staff 0B Jan 26 17:29 76d35d1a-1164-4455-80a8-7a07ffa6f26b -rw-r--r-- 1 uce staff 0B Jan 26 17:29 7889bed2-a1e3-457a-996a-30116ce16983 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 7c8caa1d-0cd9-4e0e-ac2d-31cc8df440f7 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 8eb2849e-5352-433f-ba3c-86f57d6db9a1 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 8f1884c5-f052-4099-abbe-f6895ffd01f0 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 a19973c4-6a74-44c8-b2f7-b72fe2b68f38 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 a7552360-feae-4b85-82de-0cfd6fe8e954 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 bb241a71-3fa8-476b-97cc-0bb85708c811 -rw-r--r-- 1 uce staff 0B Jan 26 17:29 c9a7f073-f8ba-460d-b34b-9ead355e168a -rw-r--r-- 1 uce staff
[jira] [Created] (FLINK-5664) RocksDBBackend logging is noisy
Ufuk Celebi created FLINK-5664: -- Summary: RocksDBBackend logging is noisy Key: FLINK-5664 URL: https://issues.apache.org/jira/browse/FLINK-5664 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Running a job with rocks db and a low checkpointing interval the logs are flooded with {code} 2017-01-26 17:29:39,345 INFO org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend - Asynchronous RocksDB snapshot (File Stream Factory @ file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679, synchronous part) in thread Thread[Flat Map -> Sink: Unnamed (3/8),5,Flink Task Threads] took 0 ms. 2017-01-26 17:29:39,398 INFO org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend - Asynchronous RocksDB snapshot (File Stream Factory @ file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679, asynchronous part) in thread Thread[pool-88-thread-1,5,Flink Task Threads] took 53 ms. {code} Two log lines per stateful task per task manager per checkpoint. I can see that this is helpful for debugging, but maybe too much by default. Should we decrease the log level to debug? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5663) Checkpoint fails because of closed registry
Ufuk Celebi created FLINK-5663: -- Summary: Checkpoint fails because of closed registry Key: FLINK-5663 URL: https://issues.apache.org/jira/browse/FLINK-5663 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi While testing the 1.2.0 release I got the following Exception: {code} 2017-01-26 17:29:20,602 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source (3/8) (2dbce778c4e53a39dec3558e868ceef4) switched from RUNNING to FAILED. java.lang.Exception: Error while triggering checkpoint 2 for Source: Custom Source (3/8) at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Could not perform checkpoint 2 for operator Source: Custom Source (3/8). at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:533) at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1108) ... 5 more Caused by: java.lang.Exception: Could not complete snapshot 2 for operator Source: Custom Source (3/8). at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:372) at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1116) at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1052) at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:640) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:585) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:528) ... 6 more Caused by: java.io.IOException: Could not flush and close the file system output stream to file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679/chk-2/e4415d0d-719c-48df-91a9-3171ba468152 in order to obtain the stream state handle at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:333) at org.apache.flink.runtime.state.DefaultOperatorStateBackend.snapshot(DefaultOperatorStateBackend.java:200) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:357) ... 11 more Caused by: java.io.IOException: Could not open output stream for state backend at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:368) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.flush(FsCheckpointStreamFactory.java:225) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:305) ... 13 more Caused by: java.io.IOException: Cannot register Closeable, registry is already closed. Closing argument. at org.apache.flink.util.AbstractCloseableRegistry.registerClosable(AbstractCloseableRegistry.java:63) at org.apache.flink.core.fs.ClosingFSDataOutputStream.wrapSafe(ClosingFSDataOutputStream.java:99) at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.create(SafetyNetWrapperFileSystem.java:123) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:359) ... 15 more {code} The job recovered and kept running. [~stefanrichte...@gmail.com] Can this be a race with the closable registry? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5618) Add queryable state documentation
Ufuk Celebi created FLINK-5618: -- Summary: Add queryable state documentation Key: FLINK-5618 URL: https://issues.apache.org/jira/browse/FLINK-5618 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Adds docs about how to use queryable state usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5611) Add QueryableStateException type
Ufuk Celebi created FLINK-5611: -- Summary: Add QueryableStateException type Key: FLINK-5611 URL: https://issues.apache.org/jira/browse/FLINK-5611 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor We currently have some exceptions like {{UnknownJobManager}} and the like that should be sub types of the to be introduced {{QueryableStateException}}. Right now, they extend checked and unchecked Exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5609) Add last update time to docs
Ufuk Celebi created FLINK-5609: -- Summary: Add last update time to docs Key: FLINK-5609 URL: https://issues.apache.org/jira/browse/FLINK-5609 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi Add a small text to the start page stating when the docs was last updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5607) Move location lookup retry out of KvStateLocationLookupService
Ufuk Celebi created FLINK-5607: -- Summary: Move location lookup retry out of KvStateLocationLookupService Key: FLINK-5607 URL: https://issues.apache.org/jira/browse/FLINK-5607 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Assignee: Ufuk Celebi If a state location lookup fails because of an {{UnknownJobManager}}, the lookup service will automagically retry this. I think it's better to move this out of the lookup service and the retry be handled out side by the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5606) Remove magic number in key and namespace serialization
Ufuk Celebi created FLINK-5606: -- Summary: Remove magic number in key and namespace serialization Key: FLINK-5606 URL: https://issues.apache.org/jira/browse/FLINK-5606 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor The serialized key and namespace for state queries contains a magic number between the key and namespace: {{key|42|namespace}}. This was for historical reasons in order to skip deserialization of the key and namespace for our old {{RocksDBStateBackend}} which used the same format. This has now been superseded by the keygroup aware state backends and there is no point in doing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5605) Make KvStateRequestSerializer package private
Ufuk Celebi created FLINK-5605: -- Summary: Make KvStateRequestSerializer package private Key: FLINK-5605 URL: https://issues.apache.org/jira/browse/FLINK-5605 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Priority: Minor >From early users I've seen that many people use the KvStateRequestSerializer >in their programs. This was actually meant as an internal package to be used >by the client and server for internal message serialization. I vote to make this package private and create an explicit {{QueryableStateClientUtil}} for user serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5604) Replace QueryableStateClient constructor
Ufuk Celebi created FLINK-5604: -- Summary: Replace QueryableStateClient constructor Key: FLINK-5604 URL: https://issues.apache.org/jira/browse/FLINK-5604 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Assignee: Ufuk Celebi The {{QueryableStateClient}} constructor expects a configuration object which makes it very hard for users to see what's expected to be there and what not. I propose to split this constructor up and add some static helper for the common cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5603) Use Flink's futures in QueryableStateClient
Ufuk Celebi created FLINK-5603: -- Summary: Use Flink's futures in QueryableStateClient Key: FLINK-5603 URL: https://issues.apache.org/jira/browse/FLINK-5603 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor The current {{QueryableStateClient}} exposes Scala's Futures as the return type for queries. Since we are trying to get away from hard Scala dependencies in the current master, we should proactively replace this with Flink's Future interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5602) Migration with RocksDB job led to NPE for next checkpoint
Ufuk Celebi created FLINK-5602: -- Summary: Migration with RocksDB job led to NPE for next checkpoint Key: FLINK-5602 URL: https://issues.apache.org/jira/browse/FLINK-5602 Project: Flink Issue Type: Bug Reporter: Ufuk Celebi When migrating a job with RocksDB I got the following Exception when the next checkpoint was triggered. This only happened once and I could not reproduce it ever since. [~stefanrichte...@gmail.com] Maybe we can look over the code and check what could have failed here? I unfortunately don't have more available of the stack trace. I don't think that this will be very helpful will it? {code} at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:58) at org.apache.flink.runtime.state.KeyedBackendSerializationProxy$StateMetaInfo.(KeyedBackendSerializationProxy.java:126) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBSnapshotOperation.writeKVStateMetaData(RocksDBKeyedStateBackend.java:471) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBSnapshotOperation.writeDBSnapshot(RocksDBKeyedStateBackend.java:382) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$1.performOperation(RocksDBKeyedStateBackend.java:280) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$1.performOperation(RocksDBKeyedStateBackend.java:262) at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:37) ... 6 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5600) Improve error message when triggering savepoint without specified directory
Ufuk Celebi created FLINK-5600: -- Summary: Improve error message when triggering savepoint without specified directory Key: FLINK-5600 URL: https://issues.apache.org/jira/browse/FLINK-5600 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Minor When triggering a savepoint w/o specifying a custom target directory or having configured a default directory, we get a quite long stack trace: {code} java.lang.Exception: Failed to trigger savepoint at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:801) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalStateException: No savepoint directory configured. You can either specify a directory when triggering this savepoint or configure a cluster-wide default via key 'state.savepoints.dir'. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:764) ... 22 more {code} This is already quite good, because the Exception says what can be done to work around this problem, but we can make it even better by handling this error in the client and printing a more explicit message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5599) State interface docs often refer to keyed state only
Ufuk Celebi created FLINK-5599: -- Summary: State interface docs often refer to keyed state only Key: FLINK-5599 URL: https://issues.apache.org/jira/browse/FLINK-5599 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Minor The JavaDocs of the {{State}} interface (and related classes) often mention keyed state only as the state interface was only exposed for keyed state until Flink 1.1. With the new {{CheckpointedFunction}} interface, this has changed and the docs should be adjusted accordingly. Would be nice to address this with 1.2.0 so that the JavaDocs are updated for users. [~stefanrichte...@gmail.com] or [~aljoscha] maybe you can have a look at this briefly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5587) AsyncWaitOperatorTest timed out on Travis
Ufuk Celebi created FLINK-5587: -- Summary: AsyncWaitOperatorTest timed out on Travis Key: FLINK-5587 URL: https://issues.apache.org/jira/browse/FLINK-5587 Project: Flink Issue Type: Test Reporter: Ufuk Celebi The Maven watch dog script cancelled the build and printed a stack trace for {{AsyncWaitOperatorTest.testOperatorChainWithProcessingTime(AsyncWaitOperatorTest.java:379)}}. https://s3.amazonaws.com/archive.travis-ci.org/jobs/192441719/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5574) Add checkpoint statistics docs
Ufuk Celebi created FLINK-5574: -- Summary: Add checkpoint statistics docs Key: FLINK-5574 URL: https://issues.apache.org/jira/browse/FLINK-5574 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor Add docs about the current state of checkpoint monitoring. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5560) Header in checkpoint stats summary misaligned
Ufuk Celebi created FLINK-5560: -- Summary: Header in checkpoint stats summary misaligned Key: FLINK-5560 URL: https://issues.apache.org/jira/browse/FLINK-5560 Project: Flink Issue Type: Bug Components: Webfrontend Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor The checkpoint summary stats table header line is misaligned. The first and second head columns need to be swapped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5556) BarrierBuffer resets bytes written on spiller roll over
Ufuk Celebi created FLINK-5556: -- Summary: BarrierBuffer resets bytes written on spiller roll over Key: FLINK-5556 URL: https://issues.apache.org/jira/browse/FLINK-5556 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi Assignee: Ufuk Celebi Priority: Minor When rolling over a spilled sequence of buffers, the tracker bytes written are reset to 0. They are reported to the checkpoint listener right after this operation, which results in the reported buffered bytes always being 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5510) Replace Scala Future with FlinkFuture in QueryableStateClient
Ufuk Celebi created FLINK-5510: -- Summary: Replace Scala Future with FlinkFuture in QueryableStateClient Key: FLINK-5510 URL: https://issues.apache.org/jira/browse/FLINK-5510 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Priority: Minor The entry point for queryable state users is the {{QueryableStateClient}} which returns query results via Scala Futures. Since merging the initial version of QueryableState we have introduced the FlinkFuture wrapper type in order to not expose our Scala dependency via the API. Since APIs tend to stick around longer than expected, it might be worthwhile to port the exposed QueryableStateClient interface to use the FlinkFuture. Early users can still get the Scala Future via FlinkFuture#getScalaFuture(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5509) Replace QueryableStateClient keyHashCode argument
Ufuk Celebi created FLINK-5509: -- Summary: Replace QueryableStateClient keyHashCode argument Key: FLINK-5509 URL: https://issues.apache.org/jira/browse/FLINK-5509 Project: Flink Issue Type: Improvement Components: Queryable State Reporter: Ufuk Celebi Priority: Minor When going over the low level QueryableStateClient with [~NicoK] we noticed that the key hashCode argument can be confusing to users: {code} FuturegetKvState( JobID jobId, String name, int keyHashCode, byte[] serializedKeyAndNamespace) {code} The {{keyHashCode}} argument is the result of calling {{hashCode()}} on the key to look up. This is what is send to the JobManager in order to look up the location of the key. While pretty straight forward, it is repetitive and possibly confusing. As an alternative we suggest to make the method generic and simply call hashCode on the object ourselves. This way the user just provides the key object. Since there are some early users of the queryable state API already, we would suggest to rename the method in order to provoke a compilation error after upgrading to the actually released 1.2 version. (This would also work without renaming since the hashCode of Integer (what users currently provide) is the same number, but it would be confusing why it acutally works.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5484) Kryo serialization changed between 1.1 and 1.2
Ufuk Celebi created FLINK-5484: -- Summary: Kryo serialization changed between 1.1 and 1.2 Key: FLINK-5484 URL: https://issues.apache.org/jira/browse/FLINK-5484 Project: Flink Issue Type: Bug Components: Type Serialization System Reporter: Ufuk Celebi I think the way that Kryo serializes data changed between 1.1 and 1.2. I have a generic Object that is serialized as part of a 1.1 savepoint that I cannot resume from with 1.2: {code} org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:68) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1486) at com.dataartisans.DidKryoChange.main(DidKryoChange.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339) at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073) at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120) at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117) at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:900) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalStateException: Could not initialize keyed state backend. at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:286) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:199) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:649) at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:636) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:254) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: Unable to find class: f at
[jira] [Created] (FLINK-5467) Stateless chained tasks set legacy operator state
Ufuk Celebi created FLINK-5467: -- Summary: Stateless chained tasks set legacy operator state Key: FLINK-5467 URL: https://issues.apache.org/jira/browse/FLINK-5467 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Reporter: Ufuk Celebi I discovered this while trying to rescale a job with a Kafka source with a chained stateless operator. Looking into it, it turns out that this fails, because the checkpointed state contains legacy operator state for the chained operator although it is state less. /cc [~aljoscha] You mentioned that this might be a possible duplicate? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5466) Make production environment default in gulpfile
Ufuk Celebi created FLINK-5466: -- Summary: Make production environment default in gulpfile Key: FLINK-5466 URL: https://issues.apache.org/jira/browse/FLINK-5466 Project: Flink Issue Type: Improvement Components: Webfrontend Affects Versions: 1.1.4, 1.2.0 Reporter: Ufuk Celebi Currently the default environment set in our gulpfile is development, which lead to very large created JS files. When building the web UI we apparently forgot to set the environment to production (build via gulp production). Since this is likely to occur again, we should make the default environment production and make sure to use development manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5448) Fix typo in StateAssignmentOperation Exception
Ufuk Celebi created FLINK-5448: -- Summary: Fix typo in StateAssignmentOperation Exception Key: FLINK-5448 URL: https://issues.apache.org/jira/browse/FLINK-5448 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Trivial {code} Cannot restore the latest checkpoint because the operator cbc357ccb763df2852fee8c4fc7d55f2 has non-partitioned state and its parallelism changed. The operatorcbc357ccb763df2852fee8c4fc7d55f2 has parallelism 2 whereas the correspondingstate object has a parallelism of 4 {code} White space is missing in some places. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5440) Misleading error message when migrating and scaling down from savepoint
Ufuk Celebi created FLINK-5440: -- Summary: Misleading error message when migrating and scaling down from savepoint Key: FLINK-5440 URL: https://issues.apache.org/jira/browse/FLINK-5440 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Minor When resuming from an 1.1 savepoint with 1.2 and reducing the parallelism (and correctly setting the max parallelism), the error message says something about a missing operator which is misleading. Restoring from the same savepoint with the savepoint parallelism works as expected. Instead it should state that this kind of operation is not possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5439) Adjust max parallelism when migrating
Ufuk Celebi created FLINK-5439: -- Summary: Adjust max parallelism when migrating Key: FLINK-5439 URL: https://issues.apache.org/jira/browse/FLINK-5439 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Reporter: Ufuk Celebi Priority: Minor When migrating from v1 savepoints which don't have the notion of a max parallelism, the job needs to explicitly set the max parallelism to the parallelism of the savepoint. [~stefanrichte...@gmail.com] If this not trivially implemented, let's close this as won't fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)