from:"Ufuk Celebi \(JIRA\)"

[jira] [Created] (FLINK-35509) Slack community invite link has expired

2024-06-03 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35509:
---

 Summary: Slack community invite link has expired
 Key: FLINK-35509
 URL: https://issues.apache.org/jira/browse/FLINK-35509
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Ufuk Celebi


The Slack invite link on the website has expired.

I've generated a new invite link without expiration here: 
[https://join.slack.com/t/apache-flink/shared_invite/zt-2k0fdioxx-D0kTYYLh3pPjMu5IItqx3Q]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35481) Add HISTOGRAM function in SQL & Table API

2024-05-29 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35481:
---

 Summary: Add HISTOGRAM function in SQL & Table API
 Key: FLINK-35481
 URL: https://issues.apache.org/jira/browse/FLINK-35481
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Reporter: Ufuk Celebi
 Fix For: 1.20.0


Consider adding a HISTOGRAM aggregate function similar to ksqlDB 
(https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/#histogram).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35480) Add FIELD function in SQL & Table API

2024-05-29 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35480:
---

 Summary: Add FIELD function in SQL & Table API
 Key: FLINK-35480
 URL: https://issues.apache.org/jira/browse/FLINK-35480
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / API
Reporter: Ufuk Celebi
 Fix For: 1.20.0


Add support for the {{FIELD}} function to return the position of {{str}} in 
{{{}args{}}}, or 0 if not found.

*References*
 * ksqlDB: 
[https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#field]
 * MySQL: 
https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_elt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35239) 1.19 docs show outdated warning

2024-04-25 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35239:
---

 Summary: 1.19 docs show outdated warning
 Key: FLINK-35239
 URL: https://issues.apache.org/jira/browse/FLINK-35239
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.19.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
 Fix For: 1.19.0
 Attachments: Screenshot 2024-04-25 at 15.01.57.png

The docs for 1.19 are currently marked as outdated although it's the currently 
stable release.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35038) Bump test dependency org.yaml:snakeyaml to 2.2

2024-04-07 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-35038:
---

 Summary: Bump test dependency org.yaml:snakeyaml to 2.2 
 Key: FLINK-35038
 URL: https://issues.apache.org/jira/browse/FLINK-35038
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / Kafka
Affects Versions: 3.1.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
 Fix For: 3.1.0


Usage of SnakeYAML via {{flink-shaded}} was replaced by an explicit test scope 
dependency on {{org.yaml:snakeyaml:1.31}} with FLINK-34193.

This outdated version of SnakeYAML triggers security warnings. These should not 
be an actual issue given the test scope, but we should consider bumping the 
version for security hygiene purposes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover

2021-03-23 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-21928:
---

 Summary: DuplicateJobSubmissionException after JobManager failover
 Key: FLINK-21928
 URL: https://issues.apache.org/jira/browse/FLINK-21928
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.12.2, 1.11.3, 1.10.3
 Environment: StandaloneApplicationClusterEntryPoint using a fixed job 
ID, High Availability enabled
Reporter: Ufuk Celebi


Consider the following scenario:
 * Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, 
high availability enabled
 * Flink job reaches a globally terminal state

 * Flink job is marked as finished in the high-availability service's 
RunningJobsRegistry

 * The JobManager fails over

On recovery, the [Dispatcher throws DuplicateJobSubmissionException, because 
the job is marked as done in the 
RunningJobsRegistry|https://github.com/apache/flink/blob/release-1.12.2/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L332-L340].

When this happens, users cannot get out of the situation without manually 
redeploying the JobManager process and changing the job ID^1^.

The desired semantics are that we don't want to re-execute a job that has 
reached a globally terminal state. In this particular case, we know that the 
job has already reached such a state (as it has been marked in the registry). 
Therefore, we could handle this case by executing the regular termination 
sequence instead of throwing a DuplicateJobSubmission.

---

^1^ With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes HA, 
there is no  notion of ephemeral data that is tied to a session in the first 
place afaik.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17500) Deploy JobGraph form file in StandaloneClusterEntrypoint

2020-05-04 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-17500:
---

 Summary: Deploy JobGraph form file in StandaloneClusterEntrypoint
 Key: FLINK-17500
 URL: https://issues.apache.org/jira/browse/FLINK-17500
 Project: Flink
  Issue Type: Wish
  Components: Deployment / Docker
Reporter: Ufuk Celebi


We have a requirement to deploy a pre-generated {{JobGraph}} from a file in 
{{StandaloneClusterEntrypoint}}.

Currently, {{StandaloneClusterEntrypoint}} only supports deployment of a Flink 
job from the class path using {{ClassPathPackagedProgramRetriever}}. 

Our desired behaviour would be as follows:

If {{internal.jobgraph-path}} is set, prepare a {{PackagedProgram}} from a 
local {{JobGraph}} file using {{FileJobGraphRetriever}}. Otherwise, deploy 
using {{ClassPathPackagedProgramRetriever}} (current behavior).

---

I understand that this requirement is pretty niche, but wanted to get feedback 
whether the Flink community would be open to supporting this nonetheless.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15831) Add Docker image publication to release documentation

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15831:
---

 Summary: Add Docker image publication to release documentation
 Key: FLINK-15831
 URL: https://issues.apache.org/jira/browse/FLINK-15831
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Ufuk Celebi


The [release documentation in the project 
Wiki|https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release]
 describes the release process.

We need to add a note to follow up with the Docker image publication process as 
part of the release checklist. The actual documentation should probably be 
self-contained in the apache/flink-docker repository, but we should definitely 
link to it.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15830) Migrate docker-flink/docker-flink to apache/flink-docker

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15830:
---

 Summary: Migrate docker-flink/docker-flink to apache/flink-docker
 Key: FLINK-15830
 URL: https://issues.apache.org/jira/browse/FLINK-15830
 Project: Flink
  Issue Type: Sub-task
Reporter: Ufuk Celebi
Assignee: Patrick Lucas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15829) Request apache/flink-docker repository

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15829:
---

 Summary: Request apache/flink-docker repository
 Key: FLINK-15829
 URL: https://issues.apache.org/jira/browse/FLINK-15829
 Project: Flink
  Issue Type: Sub-task
Reporter: Ufuk Celebi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-15828) Integrate docker-flink/docker-flink into Flink release process

2020-01-31 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-15828:
---

 Summary: Integrate docker-flink/docker-flink into Flink release 
process
 Key: FLINK-15828
 URL: https://issues.apache.org/jira/browse/FLINK-15828
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Docker, Release System
Reporter: Ufuk Celebi


This ticket tracks the first phase of Flink Docker image build consolidation.

The goal of this story is to integrate Docker image publication with the Flink 
release process and provide convenience packages of released Flink artifacts on 
DockerHub.

For more details, check the 
[DISCUSS|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36139.html]
 and 
[VOTE|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Integrate-Flink-Docker-image-publication-into-Flink-release-process-td36982.html]
 threads on the mailing list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-14145) getLatestCheckpoint returns wrong checkpoint

2019-09-20 Thread Ufuk Celebi (Jira)

Ufuk Celebi created FLINK-14145:
---

 Summary: getLatestCheckpoint returns wrong checkpoint
 Key: FLINK-14145
 URL: https://issues.apache.org/jira/browse/FLINK-14145
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Ufuk Celebi


The flag to prefer checkpoints for recovery introduced in FLINK-11159 breaks 
returns the wrong checkpoint as the latest one if enabled.

The current implementation assumes that the latest checkpoint is a savepoint 
and skips over it. I attached a patch for 
{{StandaloneCompletedCheckpointStoreTest}} that demonstrates the issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-12813) Add Hadoop profile in building from source docs

2019-06-12 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12813:
---

 Summary: Add Hadoop profile in building from source docs
 Key: FLINK-12813
 URL: https://issues.apache.org/jira/browse/FLINK-12813
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0
Reporter: Ufuk Celebi


The docs for [building from source with 
Hadoop|https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#hadoop-versions]
 omit the {{-Pinclude-hadoop}} profile in two code snippets.

The two code snippets that have {{-Dhadoop.version}} set, but no 
{{-Pinclude-hadoop}} need to be updated to include {{-Pinclude-hadoop}} as well.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-12313) SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints is unstable

2019-04-24 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12313:
---

 Summary: 
SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints 
is unstable
 Key: FLINK-12313
 URL: https://issues.apache.org/jira/browse/FLINK-12313
 Project: Flink
  Issue Type: Test
  Components: Runtime / Checkpointing
Reporter: Ufuk Celebi


{{SynchronousCheckpointITCase.taskCachedThreadPoolAllowsForSynchronousCheckpoints}}
 fails and prints the Thread stack traces due to no output on Travis 
occasionally.

{code}
==
Printing stack trace of Java process 10071
==
2019-04-24 07:55:29
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode):

"Attach Listener" #17 daemon prio=9 os_prio=0 tid=0x7f294892 nid=0x2cf5 
waiting on condition [0x]
   java.lang.Thread.State: RUNNABLE

"Async calls on Test Task (1/1)" #15 daemon prio=5 os_prio=0 
tid=0x7f2948dd1800 nid=0x27a9 waiting on condition [0x7f292cea9000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x8bb5e558> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"Async calls on Test Task (1/1)" #14 daemon prio=5 os_prio=0 
tid=0x7f2948dce800 nid=0x27a8 in Object.wait() [0x7f292cfaa000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8bac58f8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousSavepointLatch.blockUntilCheckpointIsAcknowledged(SynchronousSavepointLatch.java:66)
- locked <0x8bac58f8> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:726)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:604)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.triggerCheckpoint(SynchronousCheckpointITCase.java:174)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1182)
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"CloseableReaperThread" #13 daemon prio=5 os_prio=0 tid=0x7f2948d9b800 
nid=0x27a7 in Object.wait() [0x7f292d0ab000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
- locked <0x8bbe3990> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at 
org.apache.flink.core.fs.SafetyNetCloseableRegistry$CloseableReaperThread.run(SafetyNetCloseableRegistry.java:193)

"Test Task (1/1)" #12 prio=5 os_prio=0 tid=0x7f2948d97000 nid=0x27a6 in 
Object.wait() [0x7f292d1ac000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x8e63f7d8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at 
org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:63)
- locked <0x8e63f7d8> (a java.lang.Object)
at 
org.apache.flink.streaming.runtime.tasks.SynchronousCheckpointITCase$SynchronousCheckpointTestingTask.run(SynchronousCheckpointITCase.java:161)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:724)
at java.lang.Thread.run(Thread.java:748)

"process reaper" #11 daemon prio=10 os_prio=0 tid=0x7f294885e000 nid=0x2793 
waiting on condition [0x7f292d7e5000]

[jira] [Created] (FLINK-12060) Unify change pom version scripts

2019-03-28 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-12060:
---

 Summary: Unify change pom version scripts
 Key: FLINK-12060
 URL: https://issues.apache.org/jira/browse/FLINK-12060
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Scripts
Affects Versions: 1.8.0
Reporter: Ufuk Celebi


We have three places in `tools` that we use to update the pom versions when 
releasing:
1. https://github.com/apache/flink/blob/048367b/tools/change-version.sh#L31
2. 
https://github.com/apache/flink/blob/048367b/tools/releasing/create_release_branch.sh#L60
3. 
https://github.com/apache/flink/blob/048367b/tools/releasing/update_branch_version.sh#L52

The 1st option is buggy (it does not work with the new versioning of the shaded 
Hadoop build, e.g. {{2.4.1-1.9-SNAPSHOT}} will not be replaced). The 2nd and 
3rd work for pom files, but the 2nd one misses a change for the doc version 
that is present in the 3rd one.

I think we should unify these and call them where needed instead of duplicating 
this code in unexpected ways.

An initial quick fix could remove the 1st script and update the 2rd one to 
match the 3rd one.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11784) KryoSerializerSnapshotTest occasionally fails on Travis

2019-02-28 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11784:
---

 Summary: KryoSerializerSnapshotTest occasionally fails on Travis
 Key: FLINK-11784
 URL: https://issues.apache.org/jira/browse/FLINK-11784
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System
Reporter: Ufuk Celebi


{{KryoSerializerSnapshotTest}} fails occasionally with:
{code:java}
11:37:44.198 [ERROR] 
tryingToRestoreWithNonExistingClassShouldBeIncompatible(org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest)
  Time elapsed: 0.011 s  <<< ERROR!
java.io.EOFException
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.kryoSnapshotWithMissingClass(KryoSerializerSnapshotTest.java:120)
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotTest.tryingToRestoreWithNonExistingClassShouldBeIncompatible(KryoSerializerSnapshotTest.java:105){code}
See [https://travis-ci.org/apache/flink/jobs/499371953] for full build output 
(as part of a PR with unrelated changes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11752) Move flink-python to opt

2019-02-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11752:
---

 Summary: Move flink-python to opt
 Key: FLINK-11752
 URL: https://issues.apache.org/jira/browse/FLINK-11752
 Project: Flink
  Issue Type: Improvement
  Components: Build System, Python API
Affects Versions: 1.7.2
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


As discussed on the [dev mailing 
list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Move-flink-python-to-opt-td27347.html],
 we should move flink-python to opt instead of having it in lib by default.

The streaming counter part (flink-streaming-python) is only as part of opt 
already.

I think we don't have many users of the Python batch API and this will make the 
streaming/batch experience more consistent and would result in a cleaner 
default classpath. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11545) Add option to manually set job ID in StandaloneJobClusterEntryPoint

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11545:
---

 Summary: Add option to manually set job ID in 
StandaloneJobClusterEntryPoint
 Key: FLINK-11545
 URL: https://issues.apache.org/jira/browse/FLINK-11545
 Project: Flink
  Issue Type: Sub-task
  Components: Cluster Management, Kubernetes
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Add an option to specify the job ID during job submissions via the 
StandaloneJobClusterEntryPoint. The entry point fixes the job ID to be all 
zeros currently.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11546) Add option to manually set job ID in CLI

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11546:
---

 Summary: Add option to manually set job ID in CLI
 Key: FLINK-11546
 URL: https://issues.apache.org/jira/browse/FLINK-11546
 Project: Flink
  Issue Type: Sub-task
  Components: Client
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


Add an option to specify the job ID during job submissions via the CLI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11544) Add option manually set job ID for job submissions via REST API

2019-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11544:
---

 Summary: Add option manually set job ID for job submissions via 
REST API
 Key: FLINK-11544
 URL: https://issues.apache.org/jira/browse/FLINK-11544
 Project: Flink
  Issue Type: Sub-task
  Components: REST
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Add an option to specify the job ID during job submissions via the REST API.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11534) Don't exit JVM after job termination with standalone job

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11534:
---

 Summary: Don't exit JVM after job termination with standalone job
 Key: FLINK-11534
 URL: https://issues.apache.org/jira/browse/FLINK-11534
 Project: Flink
  Issue Type: New Feature
  Components: Cluster Management
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


If a job deployed in job cluster mode terminates, the JVM running the 
StandaloneJobClusterEntryPoint will exit via System.exit(1).

When managing such a job this requires access to external systems for logging 
in order to get more details about failure causes or final termination status.

I believe that there is value in having a StandaloneJobClusterEntryPoint option 
that does not exit the JVM after the job has terminated. This allows users to 
gather further information if they are monitoring the job and manually tear 
down the process.

If there is agreement to have this feature, I would provide the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11533) Retrieve job class name from JAR manifest in ClassPathJobGraphRetriever

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11533:
---

 Summary: Retrieve job class name from JAR manifest in 
ClassPathJobGraphRetriever
 Key: FLINK-11533
 URL: https://issues.apache.org/jira/browse/FLINK-11533
 Project: Flink
  Issue Type: New Feature
  Components: Cluster Management
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


Users running job clusters distribute their user code as part of the shared 
classpath of all cluster components. We currently require users running 
{{StandaloneClusterEntryPoint}} to manually specify the job class name. JAR 
manifest entries that specify the main class of a JAR are ignored since they 
are simply part of the classpath.

I propose to add another optional command line argument to the 
{{StandaloneClusterEntryPoint}} that specifies the location of a JAR file (such 
as {{lib/usercode.jar}}) and whose Manifest is respected.

Arguments:
{code}
--job-jar 
--job-classname name
{code}

Each argument is optional, but at least one of the two is required. The 
job-classname has precedence over job-jar.

Implementation wise we should be able to simply create the PackagedProgram from 
the jar file path in ClassPathJobGraphRetriever.

If there is agreement to have this feature, I would provide the implementation.








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11525) Add option to manually set job ID

2019-02-05 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11525:
---

 Summary: Add option to manually set job ID 
 Key: FLINK-11525
 URL: https://issues.apache.org/jira/browse/FLINK-11525
 Project: Flink
  Issue Type: New Feature
  Components: Client, REST
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


When submitting Flink jobs programmatically it is desirable to have the option 
to manually set the job ID in order to have idempotent job submissions. This 
simplifies failure handling on the user side as duplicate submissions will be 
rejected by Flink. In general allowing to manually set the job ID can be 
beneficial for third party tooling.

The default behavior should not be altered. The following job submission entry 
points should be extended to allow to specify this option:
1. REST API
2. StandaloneJobClusterEntrypoint
3. CLI

Note that for 2., FLINK-10921 already suggested to allow to configure the job 
ID manually.

If there is agreement to have this feature, I'll go ahead and create sub tasks 
for the mentioned entry points (and provide the implementation for 1. and 2.).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-11402) User code can fail with an UnsatisfiedLinkError in the presence of multiple classloaders

2019-01-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11402:
---

 Summary: User code can fail with an UnsatisfiedLinkError in the 
presence of multiple classloaders
 Key: FLINK-11402
 URL: https://issues.apache.org/jira/browse/FLINK-11402
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
 Attachments: hello-snappy-1.0-SNAPSHOT.jar, hello-snappy.tgz

As reported on the user mailing list thread "[`env.java.opts` not persisting 
after job canceled or failed and then 
restarted|https://lists.apache.org/thread.html/37cc1b628e16ca6c0bacced5e825de8057f88a8d601b90a355b6a291@%3Cuser.flink.apache.org%3E];,
 there can be issues with using native libraries and user code class loading.

h2. Steps to reproduce

I was able to reproduce the issue reported on the mailing list using 
[snappy-java|https://github.com/xerial/snappy-java] in a user program. Running 
the attached user program works fine on initial submission, but results in a 
failure when re-executed.

I'm using Flink 1.7.0 using a standalone cluster started via 
{{bin/start-cluster.sh}}.

0. Unpack attached Maven project and build using {{mvn clean package}} *or* 
directly use attached {{hello-snappy-1.0-SNAPSHOT.jar}}
1. Download 
[snappy-java-1.1.7.2.jar|http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.7.2/snappy-java-1.1.7.2.jar]
 and unpack libsnappyjava for your system:
{code}
jar tf snappy-java-1.1.7.2.jar | grep libsnappy
...
org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so
...
org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib
...
{code}
2. Configure system library path to {{libsnappyjava}} in {{flink-conf.yaml}} 
(path needs to be adjusted for your system):
{code}
env.java.opts: -Djava.library.path=/.../org/xerial/snappy/native/Mac/x86_64
{code}
3. Run attached {{hello-snappy-1.0-SNAPSHOT.jar}}
{code}
bin/flink run hello-snappy-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID ae815b918dd7bc64ac8959e4e224f2b4 has finished.
Job Runtime: 359 ms
{code}
4. Rerun attached {{hello-snappy-1.0-SNAPSHOT.jar}}
{code}
bin/flink run hello-snappy-1.0-SNAPSHOT.jar
Starting execution of program


 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 
7d69baca58f33180cb9251449ddcd396)
  at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
  at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:487)
  at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
  at com.github.uce.HelloSnappy.main(HelloSnappy.java:18)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
  at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
  at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
  at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
  at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
  at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
  at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
  at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
  at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
  at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
  ... 17 more
Caused by: java.lang.UnsatisfiedLinkError: Native Library 
/.../org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib already loaded in 
another classloader
  at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1907)
  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1861)
  at java.lang.Runtime.loadLibrary0(Runtime.java:870)
  at java.lang.System.loadLibrary(System.java:1122)
  at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:182)
  at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)
  at org.xerial.snappy.Snappy.(Snappy.java:47)
  at

[jira] [Created] (FLINK-11127) Make metrics query service establish connection to JobManager

2018-12-11 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-11127:
---

 Summary: Make metrics query service establish connection to 
JobManager
 Key: FLINK-11127
 URL: https://issues.apache.org/jira/browse/FLINK-11127
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination, Kubernetes, Metrics
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


As part of FLINK-10247, the internal metrics query service has been separated 
into its own actor system. Before this change, the JobManager (JM) queried 
TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a 
separate connection to the TM metrics query service actor.

In the context of Kubernetes, this is problematic as the JM will typically 
*not* be able to resolve the TMs by name, resulting in warnings as follows:

{code}
2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has 
failed, address is now gated for [50] ms. Reason: [Association failed with 
[akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused 
by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
{code}

In order to expose the TMs by name in Kubernetes, users require a service *for 
each* TM instance which is not practical.

This currently results in the web UI not being to display some basic metrics 
about number of sent records. You can reproduce this by following the READMEs 
in {{flink-container/kubernetes}}.

This worked before, because the JM is typically exposed via a service with a 
known name and the TMs establish the connection to it which the metrics query 
service piggybacked on.

A potential solution to this might be to let the query service connect to the 
JM similar to how the TMs register.

I tagged this ticket as an improvement, but in the context of Kubernetes I 
would consider this to be a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10971) Dependency convergence issue when building flink-s3-fs-presto

2018-11-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10971:
---

 Summary: Dependency convergence issue when building 
flink-s3-fs-presto
 Key: FLINK-10971
 URL: https://issues.apache.org/jira/browse/FLINK-10971
 Project: Flink
  Issue Type: Bug
Reporter: Ufuk Celebi


Trying to trigger a savepoint to S3 with a clean build of {{release-1.7.0-rc2}} 
results in a {{java.lang.NoSuchMethodError: 
org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V}}.

*Environment*
- Tag: {{release-1.7.0-rc2}}
- Build command: {{mvn clean package -DskipTests -Dcheckstyle.skip}} 
- Maven version:
{code}
mvn -version
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
2018-06-17T20:33:14+02:00)
Maven home: /usr/local/Cellar/maven/3.5.4/libexec
Java version: 1.8.0_192, vendor: Oracle Corporation, runtime: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.14.1", arch: "x86_64", family: "mac"
{code}

*Steps to reproduce*
{code}
cp opt/flink-s3-fs-presto-1.7.0.jar lib
bin/start-cluster.sh
bin/flink run examples/streaming/TopSpeedWindowing.jar
bin/flink savepoint db37f69f21cbe54e9bf6b7e259a9c09e
{code}

*Stacktrace*
{code}
The program finished with the following exception:

org.apache.flink.util.FlinkException: Triggering a savepoint for the job 
db37f69f21cbe54e9bf6b7e259a9c09e failed.
at 
org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
at 
org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
at 
org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.CompletionException: 
java.util.concurrent.CompletionException: java.lang.NoSuchMethodError: 
org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at 
org.apache.flink.runtime.jobmaster.JobMaster.lambda$triggerSavepoint$14(JobMaster.java:970)
at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:292)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:756)
at 
org.apache.flink.runtime.jobmaster.JobMaster.lambda$acknowledgeCheckpoint$8(JobMaster.java:680)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: 
java.lang.NoSuchMethodError: 
org.apache.flink.fs.s3presto.shaded.com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
... 12 more
Caused by: java.lang.NoSuchMethodError:

[jira] [Created] (FLINK-10948) Add option to write out termination message with application status

2018-11-20 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10948:
---

 Summary: Add option to write out termination message with 
application status
 Key: FLINK-10948
 URL: https://issues.apache.org/jira/browse/FLINK-10948
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


I propose to add an option to write out a termination message to a file that 
indicates the terminal application status. With the change proposed in 
FLINK-10743, we can't use the exit code to differentiate between cancelled and 
succeeded applications.

The motivating use case for both this ticket and FLINK-10743 are Flink job 
clusters ({{StandaloneJobClusterEntryPoint}}) with Kubernetes. The idea of the 
termination message comes from Kubernetes 
([https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/)].
 

With this in place a terminated Pod will report the final status as in:
{code:java}
state:
  terminated:
exitCode: 0
finishedAt: 2018-11-20T11:00:59Z
message: CANCELED # <--- termination message
reason: Completed
startedAt: 2018-11-20T10:59:18Z
{code}

The implementation could be done in 
{{ClusterEntrypoint#runClusterEntrypoint(ClusterEntrypoint)}} which is used by 
all entry points to run Flink.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10751) Checkpoints should be retained when job reaches suspended state

2018-11-01 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10751:
---

 Summary: Checkpoints should be retained when job reaches suspended 
state
 Key: FLINK-10751
 URL: https://issues.apache.org/jira/browse/FLINK-10751
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.6.2
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


{{CheckpointProperties}} define in which terminal job status a checkpoint 
should be disposed.

I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, 
{{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) 
terminal job status {{SUSPENDED}}.

Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses 
leadership, this would result in the checkpoint to be cleaned up and not being 
available for recovery by the new leader. Therefore, we should rather retain 
checkpoints when reaching job status {{SUSPENDED}}.

*BUT:* Because we special case this terminal state in the only highly available 
{{CompletedCheckpointStore}} implementation (see 
[ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315])
 and don't use regular checkpoint disposal, this issue has not surfaced yet.

I think we should proactively fix the properties to indicate to retain 
checkpoints in {{SUSPENDED}} state. We might actually completely remove this 
case since with this change, all properties will indicate to retain on 
suspension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10743) Use 0 processExitCode for ApplicationStatus.CANCELED

2018-10-31 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10743:
---

 Summary: Use 0 processExitCode for ApplicationStatus.CANCELED
 Key: FLINK-10743
 URL: https://issues.apache.org/jira/browse/FLINK-10743
 Project: Flink
  Issue Type: Bug
  Components: Cluster Management, Kubernetes, Mesos, YARN
Affects Versions: 1.6.3
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


{{org.apache.flink.runtime.clusterframework.ApplicationStatus}} is used to map 
{{org.apache.flink.runtime.jobgraph.JobStatus}} to a process exit code.

We currently map {{ApplicationStatus.CANCELED}} to a non-zero exit code 
({{1444}}). Since cancellation is a user-triggered operation I would consider 
this to be a successful exit and map it to exit code {{0}}.

Our current behavior results in applications running via the 
{{StandaloneJobClusterEntryPoint}} and Kubernetes pods as documented in 
[flink-container|https://github.com/apache/flink/tree/master/flink-container/kubernetes]
 to be immediately restarted when cancelled. This only leaves the option of 
killing the respective job cluster master container.

The {{ApplicationStatus}} is also used in the YARN and Mesos clients, but I'm 
not familiar with that part of the code base and can't asses how changing the 
exit code would affect these clients. A quick usage scan for 
{{ApplicationStatus.CANCELED}} did not surface any problematic usages though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10441) Don't log warning when creating upload directory

2018-09-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10441:
---

 Summary: Don't log warning when creating upload directory
 Key: FLINK-10441
 URL: https://issues.apache.org/jira/browse/FLINK-10441
 Project: Flink
  Issue Type: Improvement
  Components: REST
Reporter: Ufuk Celebi


{{RestServerEndpoint.createUploadDir(Path, Logger)}} logs a warning if the 
upload directory does not exist.

{code}
2018-09-26 15:32:31,732 WARN  
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- Upload 
directory 
/var/folders/hr/cxn1_2y52qxf5nzyfq9h2scwgn/T/flink-web-2218b898-f245-4edf-b181-8f3bdc6014f3/flink-web-upload
 does not exist, or has been deleted externally. Previously uploaded files are 
no longer available.
{code}

I found this warning confusing as it is always logged when relying on the 
default configuration (via {{WebOptions}}) that picks a random directory.

Ideally, our default configurations should not result in warnings to be logged.

Therefore, I propose to log the creation of the web directory on {{INFO}} 
instead of the warning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10439) Race condition during job suspension

2018-09-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10439:
---

 Summary: Race condition during job suspension
 Key: FLINK-10439
 URL: https://issues.apache.org/jira/browse/FLINK-10439
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.7.0
Reporter: Ufuk Celebi
 Attachments: master-logs.log, race-job-suspension.png, worker-logs.log

When a {{JobMaster}} in an HA setup looses leadership, it suspends the 
execution of its job via {{JobMaster.suspend(Exception, Time)}}. This operation 
involves transitioning to the {{SUSPENDING}} job state and cancelling all 
running tasks. In some executions it may happen that the job does *not* reach 
the terminal {{SUSPENDED}} job state.

This is due to the fact that suspending the job stops related RPC endpoints 
such as the {{JobMaster}} or {{SlotPool}} (in {{JobMaster.suspend(Exception, 
Time)}} and {{JobMaster.suspendExecution( Exception)}}) immediately after 
suspending. Whenever this happens *before* the {{TaskExecutor}} instances have 
cancelled or failed the respective tasks, the job does not transition to 
{{SUSPENDED}}, because the {{ExecutionGraph}} does not receive all 
{{Execution}} state transitions.

In practice, this should not happen frequently due the fact that {{JobMaster}} 
and {{TaskExecutor}} instances are notified about the loss of leadership (or 
loss of ZooKeeper connection or similar events) around the same time. In this 
scenario, the {{TaskExecutor}} instances proactively fail the executing tasks 
and notify the {{JobMaster}}. All in all, the impact of this is limited by the 
fact that a new {{JobMaster}} leader will eventually recover the job.

*Steps to reproduce*:
- Start ZooKeeper
- Start a Flink cluster in HA mode and submit job
- Stop ZooKeeper

In some executions you will find that the job does not reach the terminal state 
{{SUSPENDED}}. Furthermore, you may see log messages similar to the following 
in this case:
{code}
The rpc endpoint org.apache.flink.runtime.jobmaster.slotpool.SlotPool has not 
been started yet. Discarding message 
org.apache.flink.runtime.rpc.messages.LocalRpcInvocation until processing is 
started.
{code}

I've attached a logs of a local run that does not transition to {{SUSPENDED}} 
and a sequence diagram of what I think may be a problematic timing.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10436) Example config uses deprecated key jobmanager.rpc.address

2018-09-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10436:
---

 Summary: Example config uses deprecated key jobmanager.rpc.address
 Key: FLINK-10436
 URL: https://issues.apache.org/jira/browse/FLINK-10436
 Project: Flink
  Issue Type: Bug
  Components: Startup Shell Scripts
Affects Versions: 1.7.0
Reporter: Ufuk Celebi


The example {{flink-conf.yaml}} shipped as part of the Flink distribution 
(https://github.com/apache/flink/blob/master/flink-dist/src/main/resources/flink-conf.yaml)
 has the following entry:
{code}
jobmanager.rpc.address: localhost
{code}

When using this key, the following deprecation warning is logged.
{code}
2018-09-26 12:01:46,608 WARN  org.apache.flink.configuration.Configuration  
- Config uses deprecated configuration key 'jobmanager.rpc.address' 
instead of proper key 'rest.address'
{code}

The example config should not use deprecated config options.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10318) Add option to build Hadoop-free job image to build.sh

2018-09-11 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-10318:
---

 Summary: Add option to build Hadoop-free job image to build.sh
 Key: FLINK-10318
 URL: https://issues.apache.org/jira/browse/FLINK-10318
 Project: Flink
  Issue Type: Improvement
  Components: Docker
Affects Versions: 1.6.0
Reporter: Ufuk Celebi


When building a Job-specific image from a release via 
{{flink-container/docker/build.sh}}, we require to specify a Hadoop version:
{code}
./build.sh
  --job-jar flink-job.jar
  --from-release
  --flink-version 1.6.0
  --hadoop-version 2.8 # <- currently required
  --scala-version 2.11
  --image-name flink-job
{code}

I think for many users a Hadoop-free build is a good default. We should 
consider supporting this out of the box with the {{build.sh}} script.

The current work around would be to manually download the Hadoop-free release 
and build with the {{--from-archive}} flag.

Another alternative would be to drop the {{from-release}} option and document 
how to build from an archive with links to the downloads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-9832) Allow commas in job submission query params

2018-07-12 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-9832:
--

 Summary: Allow commas in job submission query params
 Key: FLINK-9832
 URL: https://issues.apache.org/jira/browse/FLINK-9832
 Project: Flink
  Issue Type: Bug
  Components: REST
Affects Versions: 1.5.1
Reporter: Ufuk Celebi


As reported on the user mailing list in the thread "Run programs w/ params 
including comma via REST api" [1], submitting a job with mainArgs that include 
a comma results in an exception.

To reproduce submit a job with the following mainArgs:
{code}
--servers 10.100.98.9:9092,10.100.98.237:9092
{code}
The request fails with
{code}
Expected only one value [--servers 10.100.98.9:9092, 10.100.98.237:9092].
{code}

As a work around, users have to use a different delimiter such as {{;}}.

The proper fix of this API would make these params part of the {{POST}} request 
instead of relying on query params (as noted in FLINK-9499).

[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Run-programs-w-params-including-comma-via-REST-api-td19662.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-9690) Restoring state with FlinkKafkaProducer and Kafka 1.1.0 client fails

2018-06-28 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-9690:
--

 Summary: Restoring state with FlinkKafkaProducer and Kafka 1.1.0 
client fails
 Key: FLINK-9690
 URL: https://issues.apache.org/jira/browse/FLINK-9690
 Project: Flink
  Issue Type: Improvement
  Components: Kafka Connector
Affects Versions: 1.4.2
Reporter: Ufuk Celebi


Restoring a job from a savepoint that includes {{FlinkKafkaProducer}} packaged 
with {{kafka.version}} set to {{1.1.0}} in Flink 1.4.2.

{code}
java.lang.RuntimeException: Incompatible KafkaProducer version
at 
org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:301)
at 
org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:292)
at 
org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.resumeTransaction(FlinkKafkaProducer.java:195)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.recoverAndCommit(FlinkKafkaProducer011.java:723)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.recoverAndCommit(FlinkKafkaProducer011.java:93)
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.recoverAndCommitInternal(TwoPhaseCommitSinkFunction.java:370)
at 
org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.initializeState(TwoPhaseCommitSinkFunction.java:330)
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.initializeState(FlinkKafkaProducer011.java:856)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:258)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchFieldException: sequenceNumbers
at java.lang.Class.getDeclaredField(Class.java:2070)
at 
org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.getValue(FlinkKafkaProducer.java:297)
... 16 more
{code}

[~pnowojski] Any ideas about this issue? Judging from the stack trace it was 
anticipated that reflective access might break with Kafka versions > 0.11.2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-9389) Improve error message when cancelling job in state canceled via REST API

2018-05-17 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-9389:
--

 Summary: Improve error message when cancelling job in state 
canceled via REST API
 Key: FLINK-9389
 URL: https://issues.apache.org/jira/browse/FLINK-9389
 Project: Flink
  Issue Type: Improvement
  Components: REST
Affects Versions: 1.5.0
Reporter: Ufuk Celebi


- Request job that has been already cancelled
{code}
$ curl http://localhost:8081/jobs/b577a914ecdf710d4b93a84105dea0c9
{"jid":"b577a914ecdf710d4b93a84105dea0c9","name":"CarTopSpeedWindowingExample","isStoppable":false,"state":"CANCELED",
 ...omitted...}
{code}
- Cancel job again
{code}
$ curl -XPATCH 
http://localhost:8081/jobs/b577a914ecdf710d4b93a84105dea0c9\?mode\=cancel
{"errors":["Job could not be found."]} (HTTP 404)
{code}

Since the actual job resource still exists, I think status code 404 is 
confusing. I think a better message should indicate that the job is already in 
state CANCELED.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-8304) Document Kubernetes and Flink HA setup

2017-12-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-8304:
--

 Summary: Document Kubernetes and Flink HA setup
 Key: FLINK-8304
 URL: https://issues.apache.org/jira/browse/FLINK-8304
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi


Currently the Flink on Kubernetes documentation does not mention anything about 
running Flink in HA mode.

We should add at least the following two things:
- Currently, there cannot be a standby JobManager pod due to the way Flink HA 
works
- `high-availability.jobmanager.port` has to be set to a port that is exposed 
via Kubernetes




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-8253) flink-docs-stable still points to 1.3

2017-12-13 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-8253:
--

 Summary: flink-docs-stable still points to 1.3
 Key: FLINK-8253
 URL: https://issues.apache.org/jira/browse/FLINK-8253
 Project: Flink
  Issue Type: Bug
  Components: Documentation, Project Website
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


https://ci.apache.org/projects/flink/flink-docs-stable/ redirects to 1.3. 

As noted in 
https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release we 
have to update the stable alias in flink.conf found in 
https://svn.apache.org/repos/infra/infrastructure/buildbot/aegis/buildmaster/master1/projects.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-8138) Race in TaskAsyncCallTest leads to test time out

2017-11-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-8138:
--

 Summary: Race in TaskAsyncCallTest leads to test time out
 Key: FLINK-8138
 URL: https://issues.apache.org/jira/browse/FLINK-8138
 Project: Flink
  Issue Type: Bug
  Components: Tests
Reporter: Ufuk Celebi
Priority: Minor


{{TaskAsyncCallTest#testSetsUserCodeClassLoader}} times out with a stack trace 
on Travis on a personal branch with unrelated changes on top of 1.4.0 RC 0.

I've attached the Travis output to this issue. The main thread is stuck in
{code}
"main" #1 prio=5 os_prio=0 tid=0x7ff59000a000 nid=0xb9b in Object.wait() 
[0x7ff598965000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x833994c8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at 
org.apache.flink.core.testutils.OneShotLatch.await(OneShotLatch.java:56)
- locked <0x833994c8> (a java.lang.Object)
at 
org.apache.flink.runtime.taskmanager.TaskAsyncCallTest.testSetsUserCodeClassLoader(TaskAsyncCallTest.java:201)
{code}
There are no other Flink related threads alive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7677) Add side outputs to ProcessWindowFunction

2017-09-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7677:
--

 Summary: Add side outputs to ProcessWindowFunction
 Key: FLINK-7677
 URL: https://issues.apache.org/jira/browse/FLINK-7677
 Project: Flink
  Issue Type: Improvement
  Components: DataStream API
Affects Versions: 1.4.0
Reporter: Ufuk Celebi


As per discussion on the user mailing list [1], we should add the required 
context to collect to side output in ProcessWindowFunctions similar to 
ProcessFunction.

[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/question-on-sideoutput-from-ProcessWindow-function-td15500.html




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7676) ContinuousFileMonitoringFunction fails with GoogleHadoopFileSystem

2017-09-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7676:
--

 Summary: ContinuousFileMonitoringFunction fails with 
GoogleHadoopFileSystem
 Key: FLINK-7676
 URL: https://issues.apache.org/jira/browse/FLINK-7676
 Project: Flink
  Issue Type: Bug
  Components: Streaming Connectors
Reporter: Ufuk Celebi
Priority: Minor


The following check in ContinuousFileMonitoringFunction fails when running 
against a file in Google Cloud Storage:
{code}
Path p = new Path(path);
FileSystem fileSystem = FileSystem.get(p.toUri());
if (fileSystem.exists(p)) {
   throw new FileNotFoundException("The provided file path " + path + " does 
not exist.");
}
{code}

I suspect this has something to do with consistency guarantees provided by GCS. 
I'm wondering if it's better to fail lazily at a later stage (e.g. when opening 
the stream and it doesn't work). 

After removing this check, everything works as expected. I can also run a batch 
WordCount job against the same file.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7674) NullPointerException in ContinuousFileMonitoringFunction close

2017-09-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7674:
--

 Summary: NullPointerException in ContinuousFileMonitoringFunction 
close
 Key: FLINK-7674
 URL: https://issues.apache.org/jira/browse/FLINK-7674
 Project: Flink
  Issue Type: Bug
  Components: Streaming Connectors
Affects Versions: 1.4.0
Reporter: Ufuk Celebi
Priority: Minor


If the ContinuousFileMonitoringFunction is closed before run is called (because 
initialization fails), we get a NullPointerException, because checkpointLock 
has not been set.

{code}
synchronized (checkpointLock) {
  globalModificationTime = Long.MAX_VALUE;
  isRunning = false;
}
{code}

This results in a follow up error log:
{code}
2017-09-23 10:25:04,096 ERROR 
org.apache.flink.streaming.runtime.tasks.StreamTask   - Error during 
disposal of stream operator.
java.lang.NullPointerException
at 
org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.close(ContinuousFileMonitoringFunction.java:337)
at 
org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7666) ContinuousFileReaderOperator swallows chained watermarks

2017-09-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7666:
--

 Summary: ContinuousFileReaderOperator swallows chained watermarks
 Key: FLINK-7666
 URL: https://issues.apache.org/jira/browse/FLINK-7666
 Project: Flink
  Issue Type: Improvement
  Components: Streaming Connectors
Affects Versions: 1.3.2
Reporter: Ufuk Celebi


I use event time and read from a (finite) file. I assign watermarks right after 
the {{ContinuousFileReaderOperator}} with parallelism 1.

{code}
env
  .readFile(new TextInputFormat(...), ...)
  .setParallelism(1)
  .assignTimestampsAndWatermarks(...)
  .setParallelism(1)
  .map()...
{code}

The watermarks I assign never progress through the pipeline.

I can work around this by inserting a {{shuffle()}} after the file reader or 
starting a new chain at the assigner:
{code}
env
  .readFile(new TextInputFormat(...), ...)
  .setParallelism(1)
  .shuffle() 
  .assignTimestampsAndWatermarks(...)
  .setParallelism(1)
  .map()...
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7665) Use wait/notify in ContinuousFileReaderOperator

2017-09-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7665:
--

 Summary: Use wait/notify in ContinuousFileReaderOperator
 Key: FLINK-7665
 URL: https://issues.apache.org/jira/browse/FLINK-7665
 Project: Flink
  Issue Type: Improvement
  Components: Streaming Connectors
Affects Versions: 1.4.0
Reporter: Ufuk Celebi
Priority: Minor


{{ContinuousFileReaderOperator}} has the following loop to receive input splits:
{code}
synchronized (checkpointLock) {
  if (currentSplit == null) {
currentSplit = this.pendingSplits.poll();
if (currentSplit == null) {
  if (this.shouldClose) {
isRunning = false;
  } else {
checkpointLock.wait(50);
  }
  continue;
}
  }
}
{code}

I think we can replace this with a {{wait()}} and {{notify()}} in {{addSplit}} 
and {{close}}.

If there is a reason to keep the {{wait(50)}}, feel free to close this issue. 
:-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7643) HadoopFileSystem always reloads GlobalConfiguration

2017-09-19 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7643:
--

 Summary: HadoopFileSystem always reloads GlobalConfiguration
 Key: FLINK-7643
 URL: https://issues.apache.org/jira/browse/FLINK-7643
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


HadoopFileSystem always reloads GlobalConfiguration, which potentially leads to 
a lot of noise in the logs, because this happens on each checkpoint.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7595) Removing stateless task from task chain breaks savepoint restore

2017-09-06 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7595:
--

 Summary: Removing stateless task from task chain breaks savepoint 
restore
 Key: FLINK-7595
 URL: https://issues.apache.org/jira/browse/FLINK-7595
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Assignee: Chesnay Schepler


When removing a stateless operator from a 2-task chain where the head operator 
is stateful breaks savepoint restore with 
{code}
Caused by: java.lang.IllegalStateException: Failed to rollback to savepoint 
/var/folders/py/s_1l8vln6f19ygc77m8c4zhrgn/T/junit1167397515334838028/junit8006766303945373008/savepoint-cb0bcf-3cfa67865ac0.
 Cannot map savepoint state...
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7258) IllegalArgumentException in Netty bootstrap with large memory state segment size

2017-07-24 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7258:
--

 Summary: IllegalArgumentException in Netty bootstrap with large 
memory state segment size
 Key: FLINK-7258
 URL: https://issues.apache.org/jira/browse/FLINK-7258
 Project: Flink
  Issue Type: Bug
  Components: Network
Affects Versions: 1.3.1
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


In NettyBootstrap we configure the low and high watermarks in the following 
order:
{code}
bootstrap.childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, 
config.getMemorySegmentSize() + 1);
bootstrap.childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, 2 * 
config.getMemorySegmentSize());
{code}

When the memory segment size is higher than the default high water mark, this 
throws an `IllegalArgumentException` when a client tries to connect. Hence, 
this unfortunately only fails during runtime when a intermediate result is 
requested.

A simple fix is to first configure the high water mark and only then configure 
the low watermark.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7127) Remove unnecessary null check or add null check

2017-07-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7127:
--

 Summary: Remove unnecessary null check or add null check
 Key: FLINK-7127
 URL: https://issues.apache.org/jira/browse/FLINK-7127
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


In {{HeapKeyedStateBackend#snapshot}} we have:
{code}
for (Map.Entry> kvState : stateTables.entrySet()) {
// 1) Here we don't check for null
metaInfoSnapshots.add(kvState.getValue().getMetaInfo().snapshot());
kVStateToId.put(kvState.getKey(), kVStateToId.size());
// 2) Here we check for null
StateTable stateTable = kvState.getValue();
if (null != stateTable) {
cowStateStableSnapshots.put(stateTable, 
stateTable.createSnapshot());
}
}
{code}

Either this can lead to a NPE and we should check it in 1) or we remove the 
null check in 2). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-7067) Cancel with savepoint does not restart checkpoint scheduler on failure

2017-07-03 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-7067:
--

 Summary: Cancel with savepoint does not restart checkpoint 
scheduler on failure
 Key: FLINK-7067
 URL: https://issues.apache.org/jira/browse/FLINK-7067
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Affects Versions: 1.3.1
Reporter: Ufuk Celebi


The `CancelWithSavepoint` action of the JobManager first stops the checkpoint 
scheduler, then triggers a savepoint, and cancels the job after the savepoint 
completes.

If the savepoint fails, the command should not have any side effects and we 
don't cancel the job. The issue is that the checkpoint scheduler is not 
restarted though.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-6952) Add link to Javadocs

2017-06-20 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6952:
--

 Summary: Add link to Javadocs
 Key: FLINK-6952
 URL: https://issues.apache.org/jira/browse/FLINK-6952
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


The project webpage and the docs are missing links to the Javadocs.

I think we should add them as part of the external links at the bottom of the 
doc navigation (above "Project Page").

In the same manner we could add a link to the Scaladocs, but if I remember 
correctly there was a problem with the build of the Scaladocs. Correct, 
[~aljoscha]?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (FLINK-6182) Fix possible NPE in SourceStreamTask

2017-03-24 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6182:
--

 Summary: Fix possible NPE in SourceStreamTask
 Key: FLINK-6182
 URL: https://issues.apache.org/jira/browse/FLINK-6182
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Reporter: Ufuk Celebi
Priority: Minor


If SourceStreamTask is cancelled before being invoked, `headOperator` is not 
set yet, which leads to an NPE. This is not critical but leads to noisy logs.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-6175) HistoryServerTest.testFullArchiveLifecycle fails

2017-03-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6175:
--

 Summary: HistoryServerTest.testFullArchiveLifecycle fails
 Key: FLINK-6175
 URL: https://issues.apache.org/jira/browse/FLINK-6175
 Project: Flink
  Issue Type: Test
  Components: Tests, Webfrontend
Reporter: Ufuk Celebi
Assignee: Chesnay Schepler


https://s3.amazonaws.com/archive.travis-ci.org/jobs/213933823/log.txt

{code}
estFullArchiveLifecycle(org.apache.flink.runtime.webmonitor.history.HistoryServerTest)
  Time elapsed: 2.162 sec  <<< FAILURE!
java.lang.AssertionError: /joboverview.json did not contain valid json
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:712)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServerTest.testFullArchiveLifecycle(HistoryServerTest.java:98)
{code}

Happened on a branch with unrelated changes [~Zentol].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-6171) Some checkpoint metrics rely on latest stat snapshot

2017-03-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6171:
--

 Summary: Some checkpoint metrics rely on latest stat snapshot
 Key: FLINK-6171
 URL: https://issues.apache.org/jira/browse/FLINK-6171
 Project: Flink
  Issue Type: Bug
  Components: Metrics, State Backends, Checkpointing, Webfrontend
Reporter: Ufuk Celebi


Some checkpoint metrics use the latest stats snapshot to get the returned 
metric value. These snapshots are only updated when the {{WebRuntimeMonitor}} 
actually requests some stats (web UI or REST API).

In practice, this means that these metrics are only updated when users are 
browsing the web UI.

Instead of relying on the latest snapshot, the checkpoint metrics should be 
directly updated via the completion callbacks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-6170) Some checkpoint metrics rely on latest stat snapshot

2017-03-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6170:
--

 Summary: Some checkpoint metrics rely on latest stat snapshot
 Key: FLINK-6170
 URL: https://issues.apache.org/jira/browse/FLINK-6170
 Project: Flink
  Issue Type: Bug
  Components: Metrics, State Backends, Checkpointing, Webfrontend
Reporter: Ufuk Celebi


Some checkpoint metrics use the latest stats snapshot to get the returned 
metric value. These snapshots are only updated when the {{WebRuntimeMonitor}} 
actually requests some stats (web UI or REST API).

In practice, this means that these metrics are only updated when users are 
browsing the web UI.

Instead of relying on the latest snapshot, the checkpoint metrics should be 
directly updated via the completion callbacks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-6127) Add MissingDeprecatedCheck to checkstyle

2017-03-20 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-6127:
--

 Summary: Add MissingDeprecatedCheck to checkstyle
 Key: FLINK-6127
 URL: https://issues.apache.org/jira/browse/FLINK-6127
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Reporter: Ufuk Celebi
Priority: Minor


We should add the MissingDeprecatedCheck to our checkstyle rules to help 
avoiding deprecations without JavaDocs mentioning why the deprecation happened.

http://checkstyle.sourceforge.net/apidocs/com/puppycrawl/tools/checkstyle/checks/annotation/MissingDeprecatedCheck.html




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5999) MiniClusterITCase.runJobWithMultipleRpcServices fails

2017-03-08 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5999:
--

 Summary: MiniClusterITCase.runJobWithMultipleRpcServices fails
 Key: FLINK-5999
 URL: https://issues.apache.org/jira/browse/FLINK-5999
 Project: Flink
  Issue Type: Test
  Components: Distributed Coordination, Tests
Reporter: Ufuk Celebi


In a branch with unrelated changes to the web frontend I've seen the following 
test fail:

{code}
runJobWithMultipleRpcServices(org.apache.flink.runtime.minicluster.MiniClusterITCase)
  Time elapsed: 1.145 sec  <<< ERROR!
java.util.ConcurrentModificationException: null
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1429)
at java.util.HashMap$ValueIterator.next(HashMap.java:1458)
at 
org.apache.flink.runtime.resourcemanager.JobLeaderIdService.clear(JobLeaderIdService.java:114)
at 
org.apache.flink.runtime.resourcemanager.JobLeaderIdService.stop(JobLeaderIdService.java:92)
at 
org.apache.flink.runtime.resourcemanager.ResourceManager.shutDown(ResourceManager.java:182)
at 
org.apache.flink.runtime.resourcemanager.ResourceManagerRunner.shutDownInternally(ResourceManagerRunner.java:83)
at 
org.apache.flink.runtime.resourcemanager.ResourceManagerRunner.shutDown(ResourceManagerRunner.java:78)
at 
org.apache.flink.runtime.minicluster.MiniCluster.shutdownInternally(MiniCluster.java:313)
at 
org.apache.flink.runtime.minicluster.MiniCluster.shutdown(MiniCluster.java:281)
at 
org.apache.flink.runtime.minicluster.MiniClusterITCase.runJobWithMultipleRpcServices(MiniClusterITCase.java:72)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5993) Add section about Azure deployment

2017-03-08 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5993:
--

 Summary: Add section about Azure deployment 
 Key: FLINK-5993
 URL: https://issues.apache.org/jira/browse/FLINK-5993
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi


Running Flink on Azure can lead to unexpected problems. From the user mailing 
list:
{quote}
For anyone seeing this thread in the future, we managed to solve the issue. The 
problem was in the Azure storage SDK.
Flink is using Hadoop 2.7, so we used version 2.7.3 of the Hadoop-azure 
package. This package uses version 2.0.0 of the azure-storage package, dated 
from 2014. It has several bugs that were since fixed, specifically one where 
the socket timeout was infinite. We updated this package to version 5.0.0 and 
everything is working smoothly now.
{quote}
(http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-checkpointing-gets-stuck-td11776.html)

At least this caveat should be covered.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5928) Externalized checkpoints overwritting each other

2017-02-27 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5928:
--

 Summary: Externalized checkpoints overwritting each other
 Key: FLINK-5928
 URL: https://issues.apache.org/jira/browse/FLINK-5928
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Critical


I noticed that PR #3346 accidentally broke externalized checkpoints by using a 
fixed meta data file name. We should restore the old behaviour with creating 
random files and double check why no test caught this.

This will likely superseded by upcoming changes from [~StephanEwen] to use 
metadata streams on the JobManager.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5926) Show state backend information in web UI

2017-02-27 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5926:
--

 Summary: Show state backend information in web UI
 Key: FLINK-5926
 URL: https://issues.apache.org/jira/browse/FLINK-5926
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Reporter: Ufuk Celebi
Priority: Minor


With #3411 and follow ups, the state backends will be available at the job 
manager via the snapshot settings. We should extend the 
checkpoints/configuration tab in the web UI to also show the configured state 
backend.

https://github.com/apache/flink/pull/3411




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5923) Test instability in SavepointITCase testTriggerSavepointAndResume

2017-02-27 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5923:
--

 Summary: Test instability in SavepointITCase 
testTriggerSavepointAndResume
 Key: FLINK-5923
 URL: https://issues.apache.org/jira/browse/FLINK-5923
 Project: Flink
  Issue Type: Test
  Components: Tests
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


https://s3.amazonaws.com/archive.travis-ci.org/jobs/205042538/log.txt

{code}
Failed tests: 
  SavepointITCase.testTriggerSavepointAndResume:258 Checkpoints directory not 
cleaned up: 
[/tmp/junit1029044621247843839/junit7338507921051602138/checkpoints/47fa12635d098bdafd52def453e6d66c/chk-4]
 expected:<0> but was:<1>
{code}

I think this is due to a race in the test. When shutting down the cluster it 
can happen that in progress checkpoints linger around.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5876) Mention Scala type fallacies for queryable state client serializers

2017-02-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5876:
--

 Summary: Mention Scala type fallacies for queryable state client 
serializers
 Key: FLINK-5876
 URL: https://issues.apache.org/jira/browse/FLINK-5876
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi


FLINK-5801 shows a very hard to debug issue when querying state from the Scala 
API that can result in serializer mismatches between the client and the job.

We should mention that the type serializers should be created via the Scala 
macros.

{code}
import org.apache.flink.streaming.api.scala._
...
val keySerializer = createTypeInformation[Long].createSerializer(new 
ExecutionConfig)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5824) Fix String/byte conversion without explicit encodings

2017-02-16 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5824:
--

 Summary: Fix String/byte conversion without explicit encodings
 Key: FLINK-5824
 URL: https://issues.apache.org/jira/browse/FLINK-5824
 Project: Flink
  Issue Type: Bug
  Components: Python API, Queryable State, State Backends, 
Checkpointing, Webfrontend
Reporter: Ufuk Celebi


In a couple of places we convert Strings to bytes and bytes back to Strings 
without explicitly specifying an encoding. This can lead to problems when 
client and server default encodings differ.

The task of this JIRA is to go over the whole project and look for conversions 
where we don't specify an encoding and fix it to specify UTF-8 explicitly.

For starters, we can {{grep -R 'getBytes()' .}}, which already reveals many 
problematic places.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5781) Generation HTML from ConfigOption

2017-02-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5781:
--

 Summary: Generation HTML from ConfigOption
 Key: FLINK-5781
 URL: https://issues.apache.org/jira/browse/FLINK-5781
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Use the ConfigOption instances to generate a HTML page that we can use to 
include in the docs configuration page.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5780) Extend ConfigOption with descriptions

2017-02-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5780:
--

 Summary: Extend ConfigOption with descriptions
 Key: FLINK-5780
 URL: https://issues.apache.org/jira/browse/FLINK-5780
 Project: Flink
  Issue Type: Sub-task
  Components: Core, Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


The {{ConfigOption}} type is meant to replace the flat {{ConfigConstants}}. As 
part of automating the generation of a docs config page we need to extend  
{{ConfigOption}} with description fields.

>From the ML discussion, these could be:
{code}
void shortDescription(String);
void longDescription(String);
{code}

In practice, the description string should contain HTML/Markdown.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5779) Auto generate configuration docs

2017-02-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5779:
--

 Summary: Auto generate configuration docs
 Key: FLINK-5779
 URL: https://issues.apache.org/jira/browse/FLINK-5779
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


As per discussion on the mailing list we need to improve on the configuration 
documentation page 
(http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Discuss-Organizing-Documentation-for-Configuration-Options-td15773.html).

We decided to try to (semi) automate this in order to not get of sync in the 
future.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5778) Split FileStateHandle into fileName and basePath

2017-02-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5778:
--

 Summary: Split FileStateHandle into fileName and basePath
 Key: FLINK-5778
 URL: https://issues.apache.org/jira/browse/FLINK-5778
 Project: Flink
  Issue Type: Sub-task
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Store the statePath as a basePath and a fileName and allow to overwrite the 
basePath. We cannot overwrite the base path as long as the state handle is 
still in flight and not persisted. Otherwise we risk a resource leak.

We need this in order to be able to relocate savepoints.

{code}
interface RelativeBaseLocationStreamStateHandle {

   void clearBaseLocation();

   void setBaseLocation(String baseLocation);

}
{code}

FileStateHandle should implement this and the SavepointSerializer should 
forward the calls when a savepoint is stored or loaded, clear before store and 
set after load.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5777) Pass savepoint information to CheckpointingOperation

2017-02-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5777:
--

 Summary: Pass savepoint information to CheckpointingOperation
 Key: FLINK-5777
 URL: https://issues.apache.org/jira/browse/FLINK-5777
 Project: Flink
  Issue Type: Sub-task
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


In order to make savepoints self contained in a single directory, we need to 
pass some information to {{StreamTask#CheckpointingOperation}}.

I propose to extend the {{CheckpointMetaData}} for this.

We currently have some overlap with CheckpointMetaData, CheckpointMetrics, and 
manually passed checkpoint ID and checkpoint timestamps. We should restrict 
CheckpointMetaData to the integral meta data that needs to be passed to 
StreamTask#CheckpointingOperation.

This means that we move the CheckpointMetrics out of the CheckpointMetaData and 
the BarrierBuffer/BarrierTracker create CheckpointMetrics separately and send 
it back with the acknowledge message.

CheckpointMetaData should be extended with the following properties:
- boolean isSavepoint
- String targetDirectory

There are two code paths that lead to the CheckpointingOperation:

1. From CheckpointCoordinator via RPC to StreamTask#triggerCheckpoint
- Execution#triggerCheckpoint(long, long) 
=> triggerCheckpoint(CheckpointMetaData)
- TaskManagerGateway#triggerCheckpoint(ExecutionAttemptID, JobID, long, long) 
=> TaskManagerGateway#triggerCheckpoint(ExecutionAttemptID, JobID, 
CheckpointMetaData)
- Task#triggerCheckpointBarrier(long, long) =>  
Task#triggerCheckpointBarrier(CheckpointMetaData)

2. From intermediate streams via the CheckpointBarrier to  
StreamTask#triggerCheckpointOnBarrier
- triggerCheckpointOnBarrier(CheckpointMetaData)
=> triggerCheckpointOnBarrier(CheckpointMetaData, CheckpointMetrics)
- CheckpointBarrier(long, long) => CheckpointBarrier(CheckpointMetaData)
- AcknowledgeCheckpoint(CheckpointMetaData)
=> AcknowledgeCheckpoint(long, CheckpointMetrics)

The state backends provide another stream factory that is called in 
CheckpointingOperation when the meta data indicates savepoint. The state 
backends can choose whether they return the regular checkpoint stream factory 
in that case or a special one for savepoints. That way backends that don’t 
checkpoint to a file system can special case savepoints easily.

- FsStateBackend: return special FsCheckpointStreamFactory with different 
directory layout
- MemoryStateBackend: return regular checkpoint stream factory 
(MemCheckpointStreamFactory) => The _metadata file will contain all state as 
the state handles are part of it





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5765) Add noindex meta tag for older docs

2017-02-09 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5765:
--

 Summary: Add noindex meta tag for older docs
 Key: FLINK-5765
 URL: https://issues.apache.org/jira/browse/FLINK-5765
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Priority: Minor


We shoudl add the noindex meta tags 
(https://support.google.com/webmasters/answer/93710?hl=en_topic=4598466) to 
doc versions that are not stable and not latest snapshot, e.g. all docs <= 1.1.

There are some config values in the {{_config.yml}} file that we can use to do 
this automatically.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5764) Don't serve ancient docs

2017-02-09 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5764:
--

 Summary: Don't serve ancient docs
 Key: FLINK-5764
 URL: https://issues.apache.org/jira/browse/FLINK-5764
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Priority: Minor


We serve very old doc versions that are quite popular when searching for Flink. 
An example from a discussion on the mailing list:

Here's a concrete example. Do a google search for "flink hello world". The 
first three results are:

https://ci.apache.org/projects/flink/flink-docs-release-0.7/java_api_quickstart.html
https://ci.apache.org/projects/flink/flink-docs-release-0.8/scala_api_quickstart.html
https://ci.apache.org/projects/flink/flink-docs-release-1.2/.../java_api_quickstart.html

(http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Remove-old-lt-1-0-documentation-from-google-td11541.html)

We should remove these old versions and let them redirect to the latest stable 
docs.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5745) Set uncaught exception handler for Netty threads

2017-02-08 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5745:
--

 Summary: Set uncaught exception handler for Netty threads
 Key: FLINK-5745
 URL: https://issues.apache.org/jira/browse/FLINK-5745
 Project: Flink
  Issue Type: Improvement
  Components: Network
Reporter: Ufuk Celebi
Priority: Minor


We pass a thread factory for the Netty event loop threads (see {{NettyServer}} 
and {{NettyClient}}), but don't set an uncaught exception handler. Let's add a 
JVM terminating handler that exits the process in cause of fatal errors.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5731) Split up CI builds

2017-02-07 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5731:
--

 Summary: Split up CI builds
 Key: FLINK-5731
 URL: https://issues.apache.org/jira/browse/FLINK-5731
 Project: Flink
  Issue Type: Improvement
  Components: Build System, Tests
Reporter: Ufuk Celebi
Assignee: Robert Metzger
Priority: Critical


Test builds regularly time out because we are hitting the Travis 50 min limit. 
Previously, we worked around this by splitting up the tests into groups. I 
think we have to split them further.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (FLINK-5675) Improve QueryableStateClient API

2017-01-27 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5675:
--

 Summary: Improve QueryableStateClient API
 Key: FLINK-5675
 URL: https://issues.apache.org/jira/browse/FLINK-5675
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi


Issue to track queryable state client API changes. This is created in order to 
group some of the queryable state issues that we created for 1.2 but couldn't 
address.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5670) Local RocksDB directories not cleaned up

2017-01-27 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5670:
--

 Summary: Local RocksDB directories not cleaned up
 Key: FLINK-5670
 URL: https://issues.apache.org/jira/browse/FLINK-5670
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Minor


After cancelling a job with a RocksDB backend all files are properly cleaned 
up, but the parent directories still exist and are empty:

{code}
859546fec3dac36bb9fcc8cbdd4e291e
+- StreamFlatMap_3_0
+- StreamFlatMap_3_3
+- StreamFlatMap_3_4
+- StreamFlatMap_3_5
+- StreamFlatMap_3_6
{code}

The number of empty folders varies between runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5666) Blob files are not cleaned up from ZK storage directory

2017-01-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5666:
--

 Summary: Blob files are not cleaned up from ZK storage directory
 Key: FLINK-5666
 URL: https://issues.apache.org/jira/browse/FLINK-5666
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


When running a job with HA in an standalone cluster, the blob files are not 
cleaned up from the ZooKeeper storage directory.

{{:zkStorageDir/blob/cache/blob_:rand}}

[~NicoK] Have you seen such a behaviour while refactoring the blob server?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5665) Lingering files after failed checkpoint

2017-01-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5665:
--

 Summary: Lingering files after failed checkpoint
 Key: FLINK-5665
 URL: https://issues.apache.org/jira/browse/FLINK-5665
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


When I discovered FLINK-5663, the first two checkpoints did not go through 
because of the exception at the task manager and I got a bunch of zero byte 
files:

{code}
total 0
drwxr-xr-x  29 uce  staff   986B Jan 26 17:29 .
drwxr-xr-x   6 uce  staff   204B Jan 26 17:45 ..
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
03b9af6c-cff4-44f5-b235-20a1acae57c2
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
091cf308-a15b-42b0-9987-25932b3a7dc9
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
1f4850e2-bd46-4ec4-9540-6f22631b7b60
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
24195811-87cc-4cf9-b67a-c842aa469590
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
2d03efcd-2cbb-4f1f-88c7-2977215890ca
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
2d9ede49-fd37-479e-9973-6af3030464ef
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
39f5ef6e-e2c6-4bad-aeb4-0579b6d06d51
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
487eb08f-60d1-44a3-92eb-ad00552047b0
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
50702820-2006-43d4-9528-c5d37652c782
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
515b5222-f996-4c1f-9f1f-1086040531dc
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
647826ae-64e8-4407-94d8-c7f8cc629e78
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
6a211145-74a0-4711-bcba-d55e3ce2516d
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
6c99c24a-90ed-45ad-928a-c78cb677646c
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
752045e3-1db3-4b4e-8149-15f619c6ecc2
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
774a364b-43a3-4f29-a127-ea54f8ccb093
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
7d93ecbe-5138-432c-bdcc-dd9b8bfdd92e
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
7f664b11-afb0-45ca-9371-0c28c59d6e15
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
8fb1f4d6-3e32-46ff-af40-5d9c2193bc96
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
94d41bdc-73b4-45ca-b37c-9d8c46114530
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
af8ba3b6-036c-4c19-a78a-922c6b9dde35
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
b0c5ba31-677c-4bf7-a82b-b2b3df2f8d43
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
ba748b35-d6b6-4734-b28d-f82917aab54c
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
be3ea3c6-7ebb-4a07-a48d-5c5d3e2fc1ea
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
e0cb285c-ec38-4390-ae98-bec95c520743
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
ea14095a-3527-4762-888a-74308010a702
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
eac774d5-1564-41c1-9015-4c64701ff4bf
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
ebf3e9c7-cd96-4f32-a4c0-2ec3d8f44e60
{code}

{code}
total 0
drwxr-xr-x  29 uce  staff   986B Jan 26 17:29 .
drwxr-xr-x   6 uce  staff   204B Jan 26 17:45 ..
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
00cb4c5b-b2d2-4934-9f90-e68606038bbc
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
01ca2c72-b861-4d64-bee9-3b72f10f91cc
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
0a27c89b-cb7f-433f-9980-fda6bf8265cc
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
0bff0fe5-9b46-42f6-94be-a0f3bfa4c77d
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
0c8c85fc-71fe-4feb-bb06-d4fb5004281b
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
2506b73f-790a-4a3c-b8b9-8296766760ef
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
260fc17d-1de3-4d71-ba99-ce9b4aa44f19
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
45399fab-c7df-4c21-9b2f-e1047a82aa74
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
4e67ff32-2aea-4c6f-a061-7152420684eb
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
4ffe9e50-e7c9-4e52-8725-8d63f5c50d17
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
57e30213-45af-4645-83e3-109ab7b52829
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
5ab0676d-5af6-4229-8af3-2206a4b4c263
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
6240e59a-d7b0-468f-b228-9f3c1318c98f
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
63c099bc-a30e-4780-a390-f007c468e50e
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
76d35d1a-1164-4455-80a8-7a07ffa6f26b
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
7889bed2-a1e3-457a-996a-30116ce16983
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
7c8caa1d-0cd9-4e0e-ac2d-31cc8df440f7
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
8eb2849e-5352-433f-ba3c-86f57d6db9a1
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
8f1884c5-f052-4099-abbe-f6895ffd01f0
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
a19973c4-6a74-44c8-b2f7-b72fe2b68f38
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
a7552360-feae-4b85-82de-0cfd6fe8e954
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
bb241a71-3fa8-476b-97cc-0bb85708c811
-rw-r--r--   1 uce  staff 0B Jan 26 17:29 
c9a7f073-f8ba-460d-b34b-9ead355e168a
-rw-r--r--   1 uce  staff

[jira] [Created] (FLINK-5664) RocksDBBackend logging is noisy

2017-01-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5664:
--

 Summary: RocksDBBackend logging is noisy
 Key: FLINK-5664
 URL: https://issues.apache.org/jira/browse/FLINK-5664
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


Running a job with rocks db and a low checkpointing interval the logs are 
flooded with

{code}
2017-01-26 17:29:39,345 INFO  
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
Asynchronous RocksDB snapshot (File Stream Factory @ 
file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679, 
synchronous part) in thread Thread[Flat Map -> Sink: Unnamed (3/8),5,Flink Task 
Threads] took 0 ms.
2017-01-26 17:29:39,398 INFO  
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
Asynchronous RocksDB snapshot (File Stream Factory @ 
file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679, 
asynchronous part) in thread Thread[pool-88-thread-1,5,Flink Task Threads] took 
53 ms.
{code}

Two log lines per stateful task per task manager per checkpoint. I can see that 
this is helpful for debugging, but maybe too much by default. Should we 
decrease the log level to debug?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5663) Checkpoint fails because of closed registry

2017-01-26 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5663:
--

 Summary: Checkpoint fails because of closed registry
 Key: FLINK-5663
 URL: https://issues.apache.org/jira/browse/FLINK-5663
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


While testing the 1.2.0 release I got the following Exception:

{code}
2017-01-26 17:29:20,602 INFO  org.apache.flink.runtime.taskmanager.Task 
- Source: Custom Source (3/8) (2dbce778c4e53a39dec3558e868ceef4) 
switched from RUNNING to FAILED.
java.lang.Exception: Error while triggering checkpoint 2 for Source: Custom 
Source (3/8)
at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1117)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Could not perform checkpoint 2 for operator 
Source: Custom Source (3/8).
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:533)
at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1108)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 2 for operator 
Source: Custom Source (3/8).
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:372)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1116)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1052)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:640)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:585)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:528)
... 6 more
Caused by: java.io.IOException: Could not flush and close the file system 
output stream to 
file:/Users/uce/Desktop/1-2-testing/fs/83889867a493a1dc80f6c588c071b679/chk-2/e4415d0d-719c-48df-91a9-3171ba468152
 in order to obtain the stream state handle
at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:333)
at 
org.apache.flink.runtime.state.DefaultOperatorStateBackend.snapshot(DefaultOperatorStateBackend.java:200)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:357)
... 11 more
Caused by: java.io.IOException: Could not open output stream for state backend
at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:368)
at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.flush(FsCheckpointStreamFactory.java:225)
at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:305)
... 13 more
Caused by: java.io.IOException: Cannot register Closeable, registry is already 
closed. Closing argument.
at 
org.apache.flink.util.AbstractCloseableRegistry.registerClosable(AbstractCloseableRegistry.java:63)
at 
org.apache.flink.core.fs.ClosingFSDataOutputStream.wrapSafe(ClosingFSDataOutputStream.java:99)
at 
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.create(SafetyNetWrapperFileSystem.java:123)
at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:359)
... 15 more
{code}

The job recovered and kept running.

[~stefanrichte...@gmail.com] Can this be a race with the closable registry?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5618) Add queryable state documentation

2017-01-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5618:
--

 Summary: Add queryable state documentation
 Key: FLINK-5618
 URL: https://issues.apache.org/jira/browse/FLINK-5618
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi


Adds docs about how to use queryable state usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5611) Add QueryableStateException type

2017-01-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5611:
--

 Summary: Add QueryableStateException type
 Key: FLINK-5611
 URL: https://issues.apache.org/jira/browse/FLINK-5611
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


We currently have some exceptions like {{UnknownJobManager}} and the like that 
should be sub types of the to be introduced {{QueryableStateException}}. Right 
now, they extend checked and unchecked Exceptions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5609) Add last update time to docs

2017-01-23 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5609:
--

 Summary: Add last update time to docs
 Key: FLINK-5609
 URL: https://issues.apache.org/jira/browse/FLINK-5609
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


Add a small text to the start page stating when the docs was last updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5607) Move location lookup retry out of KvStateLocationLookupService

2017-01-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5607:
--

 Summary: Move location lookup retry out of 
KvStateLocationLookupService
 Key: FLINK-5607
 URL: https://issues.apache.org/jira/browse/FLINK-5607
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


If a state location lookup fails because of an {{UnknownJobManager}}, the 
lookup service will automagically retry this. I think it's better to move this 
out of the lookup service and the retry be handled out side by the caller.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5606) Remove magic number in key and namespace serialization

2017-01-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5606:
--

 Summary: Remove magic number in key and namespace serialization
 Key: FLINK-5606
 URL: https://issues.apache.org/jira/browse/FLINK-5606
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


The serialized key and namespace for state queries contains a magic number 
between the key and namespace: {{key|42|namespace}}. This was for historical 
reasons in order to skip deserialization of the key and namespace for our old 
{{RocksDBStateBackend}} which used the same format. This has now been 
superseded by the keygroup aware state backends and there is no point in doing 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5605) Make KvStateRequestSerializer package private

2017-01-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5605:
--

 Summary: Make KvStateRequestSerializer package private 
 Key: FLINK-5605
 URL: https://issues.apache.org/jira/browse/FLINK-5605
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Priority: Minor


>From early users I've seen that many people use the KvStateRequestSerializer 
>in their programs. This was actually meant as an internal package to be used 
>by the client and server for internal message serialization.

I vote to make this package private and create an explicit 
{{QueryableStateClientUtil}} for user serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5604) Replace QueryableStateClient constructor

2017-01-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5604:
--

 Summary: Replace QueryableStateClient constructor
 Key: FLINK-5604
 URL: https://issues.apache.org/jira/browse/FLINK-5604
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi


The {{QueryableStateClient}} constructor expects a configuration object which 
makes it very hard for users to see what's expected to be there and what not.

I propose to split this constructor up and add some static helper for the 
common cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5603) Use Flink's futures in QueryableStateClient

2017-01-22 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5603:
--

 Summary: Use Flink's futures in QueryableStateClient
 Key: FLINK-5603
 URL: https://issues.apache.org/jira/browse/FLINK-5603
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


The current {{QueryableStateClient}} exposes Scala's Futures as the return type 
for queries. Since we are trying to get away from hard Scala dependencies in 
the current master, we should proactively replace this with Flink's Future 
interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5602) Migration with RocksDB job led to NPE for next checkpoint

2017-01-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5602:
--

 Summary: Migration with RocksDB job led to NPE for next checkpoint
 Key: FLINK-5602
 URL: https://issues.apache.org/jira/browse/FLINK-5602
 Project: Flink
  Issue Type: Bug
Reporter: Ufuk Celebi


When migrating a job with RocksDB I got the following Exception when the next 
checkpoint was triggered. This only happened once and I could not reproduce it 
ever since.

[~stefanrichte...@gmail.com] Maybe we can look over the code and check what 
could have failed here? I unfortunately don't have more available of the stack 
trace. I don't think that this will be very helpful will it?

{code}
at 
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:58)
at 
org.apache.flink.runtime.state.KeyedBackendSerializationProxy$StateMetaInfo.(KeyedBackendSerializationProxy.java:126)
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBSnapshotOperation.writeKVStateMetaData(RocksDBKeyedStateBackend.java:471)
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBSnapshotOperation.writeDBSnapshot(RocksDBKeyedStateBackend.java:382)
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$1.performOperation(RocksDBKeyedStateBackend.java:280)
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$1.performOperation(RocksDBKeyedStateBackend.java:262)
at 
org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:37)
... 6 more
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5600) Improve error message when triggering savepoint without specified directory

2017-01-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5600:
--

 Summary: Improve error message when triggering savepoint without 
specified directory
 Key: FLINK-5600
 URL: https://issues.apache.org/jira/browse/FLINK-5600
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Minor


When triggering a savepoint w/o specifying a custom target directory or having 
configured a default directory, we get a quite long stack trace:

{code}
java.lang.Exception: Failed to trigger savepoint
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:801)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalStateException: No savepoint directory configured. 
You can either specify a directory when triggering this savepoint or configure 
a cluster-wide default via key 'state.savepoints.dir'.
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:764)
... 22 more
{code}

This is already quite good, because the Exception says what can be done to work 
around this problem, but we can make it even better by handling this error in 
the client and printing a more explicit message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5599) State interface docs often refer to keyed state only

2017-01-21 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5599:
--

 Summary: State interface docs often refer to keyed state only
 Key: FLINK-5599
 URL: https://issues.apache.org/jira/browse/FLINK-5599
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Minor


The JavaDocs of the {{State}} interface (and related classes) often mention 
keyed state only as the state interface was only exposed for keyed state until 
Flink 1.1. With the new {{CheckpointedFunction}} interface, this has changed 
and the docs should be adjusted accordingly.

Would be nice to address this with 1.2.0 so that the JavaDocs are updated for 
users. [~stefanrichte...@gmail.com] or [~aljoscha] maybe you can have a look at 
this briefly?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5587) AsyncWaitOperatorTest timed out on Travis

2017-01-20 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5587:
--

 Summary: AsyncWaitOperatorTest timed out on Travis
 Key: FLINK-5587
 URL: https://issues.apache.org/jira/browse/FLINK-5587
 Project: Flink
  Issue Type: Test
Reporter: Ufuk Celebi


The Maven watch dog script cancelled the build and printed a stack trace for 
{{AsyncWaitOperatorTest.testOperatorChainWithProcessingTime(AsyncWaitOperatorTest.java:379)}}.

https://s3.amazonaws.com/archive.travis-ci.org/jobs/192441719/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5574) Add checkpoint statistics docs

2017-01-19 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5574:
--

 Summary: Add checkpoint statistics docs
 Key: FLINK-5574
 URL: https://issues.apache.org/jira/browse/FLINK-5574
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


Add docs about the current state of checkpoint monitoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5560) Header in checkpoint stats summary misaligned

2017-01-18 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5560:
--

 Summary: Header in checkpoint stats summary misaligned
 Key: FLINK-5560
 URL: https://issues.apache.org/jira/browse/FLINK-5560
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


The checkpoint summary stats table header line is misaligned. The first and 
second head columns need to be swapped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5556) BarrierBuffer resets bytes written on spiller roll over

2017-01-18 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5556:
--

 Summary: BarrierBuffer resets bytes written on spiller roll over
 Key: FLINK-5556
 URL: https://issues.apache.org/jira/browse/FLINK-5556
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
Priority: Minor


When rolling over a spilled sequence of buffers, the tracker bytes written are 
reset to 0. They are reported to the checkpoint listener right after this 
operation, which results in the reported buffered bytes always being 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5510) Replace Scala Future with FlinkFuture in QueryableStateClient

2017-01-16 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5510:
--

 Summary: Replace Scala Future with FlinkFuture in 
QueryableStateClient
 Key: FLINK-5510
 URL: https://issues.apache.org/jira/browse/FLINK-5510
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Priority: Minor


The entry point for queryable state users is the {{QueryableStateClient}} which 
returns query results via Scala Futures. Since merging the initial version of 
QueryableState we have introduced the FlinkFuture wrapper type in order to not 
expose our Scala dependency via the API.

Since APIs tend to stick around longer than expected, it might be worthwhile to 
port the exposed QueryableStateClient interface to use the FlinkFuture. Early 
users can still get the Scala Future via FlinkFuture#getScalaFuture().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5509) Replace QueryableStateClient keyHashCode argument

2017-01-16 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5509:
--

 Summary: Replace QueryableStateClient keyHashCode argument
 Key: FLINK-5509
 URL: https://issues.apache.org/jira/browse/FLINK-5509
 Project: Flink
  Issue Type: Improvement
  Components: Queryable State
Reporter: Ufuk Celebi
Priority: Minor


When going over the low level QueryableStateClient with [~NicoK] we noticed 
that the key hashCode argument can be confusing to users:

{code}
Future getKvState(
  JobID jobId,
  String name,
  int keyHashCode,
  byte[] serializedKeyAndNamespace)
{code}

The {{keyHashCode}} argument is the result of calling {{hashCode()}} on the key 
to look up. This is what is send to the JobManager in order to look up the 
location of the key. While pretty straight forward, it is repetitive and 
possibly confusing.

As an alternative we suggest to make the method generic and simply call 
hashCode on the object ourselves. This way the user just provides the key 
object.

Since there are some early users of the queryable state API already, we would 
suggest to rename the method in order to provoke a compilation error after 
upgrading to the actually released 1.2 version.

(This would also work without renaming since the hashCode of Integer (what 
users currently provide) is the same number, but it would be confusing why it 
acutally works.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5484) Kryo serialization changed between 1.1 and 1.2

2017-01-13 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5484:
--

 Summary: Kryo serialization changed between 1.1 and 1.2
 Key: FLINK-5484
 URL: https://issues.apache.org/jira/browse/FLINK-5484
 Project: Flink
  Issue Type: Bug
  Components: Type Serialization System
Reporter: Ufuk Celebi


I think the way that Kryo serializes data changed between 1.1 and 1.2.

I have a generic Object that is serialized as part of a 1.1 savepoint that I 
cannot resume from with 1.2:

{code}
org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Job execution failed.
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:68)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1486)
at com.dataartisans.DidKryoChange.main(DidKryoChange.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339)
at 
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:900)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalStateException: Could not initialize keyed state 
backend.
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initKeyedState(AbstractStreamOperator.java:286)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:199)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:649)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:636)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:254)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: com.esotericsoftware.kryo.KryoException: 
Unable to find class: f
at

[jira] [Created] (FLINK-5467) Stateless chained tasks set legacy operator state

2017-01-12 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5467:
--

 Summary: Stateless chained tasks set legacy operator state
 Key: FLINK-5467
 URL: https://issues.apache.org/jira/browse/FLINK-5467
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi


I discovered this while trying to rescale a job with a Kafka source with a 
chained stateless operator.

Looking into it, it turns out that this fails, because the checkpointed state 
contains legacy operator state for the chained operator although it is state 
less.

/cc [~aljoscha] You mentioned that this might be a possible duplicate?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5466) Make production environment default in gulpfile

2017-01-12 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5466:
--

 Summary: Make production environment default in gulpfile
 Key: FLINK-5466
 URL: https://issues.apache.org/jira/browse/FLINK-5466
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Affects Versions: 1.1.4, 1.2.0
Reporter: Ufuk Celebi


Currently the default environment set in our gulpfile is development, which 
lead to very large created JS files. When building the web UI we apparently 
forgot to set the environment to production (build via gulp production).

Since this is likely to occur again, we should make the default environment 
production and make sure to use development manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5448) Fix typo in StateAssignmentOperation Exception

2017-01-11 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5448:
--

 Summary: Fix typo in StateAssignmentOperation Exception
 Key: FLINK-5448
 URL: https://issues.apache.org/jira/browse/FLINK-5448
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Trivial


{code}
Cannot restore the latest checkpoint because the operator 
cbc357ccb763df2852fee8c4fc7d55f2 has non-partitioned state and its parallelism 
changed. The operatorcbc357ccb763df2852fee8c4fc7d55f2 has parallelism 2 whereas 
the correspondingstate object has a parallelism of 4
{code}

White space is missing in some places. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5440) Misleading error message when migrating and scaling down from savepoint

2017-01-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5440:
--

 Summary: Misleading error message when migrating and scaling down 
from savepoint
 Key: FLINK-5440
 URL: https://issues.apache.org/jira/browse/FLINK-5440
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Minor


When resuming from an 1.1 savepoint with 1.2 and reducing the parallelism (and 
correctly setting the max parallelism), the error message says something about 
a missing operator which is misleading. Restoring from the same savepoint with 
the savepoint parallelism works as expected.

Instead it should state that this kind of operation is not possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLINK-5439) Adjust max parallelism when migrating

2017-01-10 Thread Ufuk Celebi (JIRA)

Ufuk Celebi created FLINK-5439:
--

 Summary: Adjust max parallelism when migrating
 Key: FLINK-5439
 URL: https://issues.apache.org/jira/browse/FLINK-5439
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Reporter: Ufuk Celebi
Priority: Minor


When migrating from v1 savepoints which don't have the notion of a max 
parallelism, the job needs to explicitly set the max parallelism to the 
parallelism of the savepoint.

[~stefanrichte...@gmail.com] If this not trivially implemented, let's close 
this as won't fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 >

1 - 100 of 313 matches

Mail list logo