[jira] [Closed] (FLINK-7316) always use off-heap network buffers

2017-11-24 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7316. - Resolution: Fixed Fix Version/s: 1.5.0 Merged in 1854a3de19. > always use off-h

[jira] [Reopened] (FLINK-5465) RocksDB fails with segfault while calling AbstractRocksDBState.clear()

2017-11-23 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter reopened FLINK-5465: --- Assignee: Stefan Richter Reconsidered for a better solution because JVM crashed can also

[jira] [Closed] (FLINK-5465) RocksDB fails with segfault while calling AbstractRocksDBState.clear()

2017-11-23 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-5465. - Resolution: Fixed Fix Version/s: 1.5.0 1.4.0 merged in d86c6b6bb3

[jira] [Closed] (FLINK-6505) Proactively cleanup local FS for RocksDBKeyedStateBackend on startup

2017-11-23 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-6505. - Resolution: Fixed Fix Version/s: 1.5.0 1.4.0 merged in ccf917de23

[jira] [Commented] (FLINK-8098) LeaseExpiredException when using FsStateBackend for checkpointing due to multiple mappers tries to access the same file.

2017-11-23 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264175#comment-16264175 ] Stefan Richter commented on FLINK-8098: --- Parallel scripts was just another option that I have seen

[jira] [Commented] (FLINK-5465) RocksDB fails with segfault while calling AbstractRocksDBState.clear()

2017-11-23 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264061#comment-16264061 ] Stefan Richter commented on FLINK-5465: --- Hi [~dernasherbrezon], it is true that this problem still

[jira] [Comment Edited] (FLINK-8125) Support limiting the number of open FileSystem connections

2017-11-22 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262343#comment-16262343 ] Stefan Richter edited comment on FLINK-8125 at 11/22/17 11:27 AM: -- Maybe

[jira] [Commented] (FLINK-8125) Support limiting the number of open FileSystem connections

2017-11-22 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262343#comment-16262343 ] Stefan Richter commented on FLINK-8125: --- Maybe we can add this monitoring to the already existing

[jira] [Commented] (FLINK-8098) LeaseExpiredException when using FsStateBackend for checkpointing due to multiple mappers tries to access the same file.

2017-11-22 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262324#comment-16262324 ] Stefan Richter commented on FLINK-8098: --- What do you mean by "parallelism of checkpointing&q

Re: Missing checkpoint when restarting failed job

2017-11-21 Thread Stefan Richter
Hi, where exactly did you read many times that incremental checkpoints cannot reference files from previous checkpoints, because we would have to correct that information. In fact, this is how incremental checkpoints work. Now for this case, I would consider it extremely unlikely that a

Re: HTTP post request example for async IO

2017-11-21 Thread Stefan Richter
Hi, did you see the example code in the documentation here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/asyncio.html ? This code does async calls agains a database, but I think it

Re: How graceful shutdown or resource clean up happens in Flink at task level ?

2017-11-21 Thread Stefan Richter
Hi, the user function’s close() method is called in AbstractStreamOperator::close() and ::dispose(). The invocation of the user function’s close() in AbstractStreamOperator::dispose() only has an effect if there was no previous invocation of the method through AbstractStreamOperator::close().

Re: Apache Flink - Question about TriggerResult.FIRE

2017-11-20 Thread Stefan Richter
Hi, > > "In the first case, it is a new window without the previous elements, in the > second case the window reflects the old contents plus all changes since the > last trigger." > > I am assuming the first case is FIRE and second case is FIRE_AND_PURGE - I > was thinking that in the first

[jira] [Comment Edited] (FLINK-8098) LeaseExpiredException when using FsStateBackend for checkpointing due to multiple mappers tries to access the same file.

2017-11-20 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259045#comment-16259045 ] Stefan Richter edited comment on FLINK-8098 at 11/20/17 10:05 AM: -- >F

[jira] [Commented] (FLINK-8098) LeaseExpiredException when using FsStateBackend for checkpointing due to multiple mappers tries to access the same file.

2017-11-20 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259045#comment-16259045 ] Stefan Richter commented on FLINK-8098: --- >From a first glance, this does not look like a Fl

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Stefan Richter
Hi, > In addition to your comments, what are the items retained by > NetworkEnvironment? They grew seems like indefinitely, do they ever reduce? > Mostly the network buffers, which should be ok. They are always recycled and should not be released until the network environment is GCed. > I

Re: [Flink] How to Converting DataStream to Dataset or Table?

2017-11-16 Thread Stefan Richter
Hi, does this answer your question: https://flink.apache.org/news/2017/03/29/table-sql-api-update.html ? Best, Stefan > Am 15.11.2017 um 20:33 schrieb Richard Xin : > > I have DataStream, is there a

[jira] [Commented] (FLINK-7873) Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover

2017-11-16 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255087#comment-16255087 ] Stefan Richter commented on FLINK-7873: --- Very good, that fits well because we are still pretty busy

[jira] [Commented] (FLINK-7873) Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover

2017-11-16 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255069#comment-16255069 ] Stefan Richter commented on FLINK-7873: --- [~sihuazhou] If you want, you can also open a PR with your

Re: R/W traffic estimation between Flink and Zookeeper

2017-11-16 Thread Stefan Richter
Hi, I think Zookeeper is only used as a meta data store in HA mode. Interactions with ZK are not part of the per-record stream processing code paths of Flink. Things that are written to ZK can (also depending on your job) include e.g. the job graph, Kafka offsets, or the meta data about

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Stefan Richter
Hi, I cannot spot anything that indicates a leak from your screenshots. Maybe you misinterpret the numbers? In your heap dump, there is only a single instance of org.apache.flink.runtime.io.network.NetworkEnvironment and it retains about 400,000,000 bytes from being GCed because it holds

Re: Apache Flink - Question about TriggerResult.FIRE

2017-11-16 Thread Stefan Richter
Hi, I think the effect is pretty straight forward, the elements in a window are not purged if the trigger is only FIRE and not FIRE_AND_PURGE. Unfortunately, your question is a bit unclear about what exactly you mean by „new window“: a truly „new“ window or another triggering of the previous

Re: RestartStrategies & checkpoints

2017-11-16 Thread Stefan Richter
Hi, if the restart strategy that you set is configured to do restarts and you have checkpoints enabled, it will try to recover from the latest checkpoint. Is there any confusing point in the documentation that made this unclear for you, which we could improve? Best, Stefan > Am 15.11.2017

[jira] [Updated] (FLINK-8086) FlinkKafkaProducer011 can permanently fail in recovery through ProducerFencedException

2017-11-15 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter updated FLINK-8086: -- Description: Chaos monkey test in a cluster environment can permanently bring down our

[jira] [Updated] (FLINK-8086) FlinkKafkaProducer011 can permanently fail in recovery through ProducerFencedException

2017-11-15 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter updated FLINK-8086: -- Summary: FlinkKafkaProducer011 can permanently fail in recovery through ProducerFencedException

[jira] [Created] (FLINK-8086) FlinkKafkaProducer011 can permanently fail in recovery

2017-11-15 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-8086: - Summary: FlinkKafkaProducer011 can permanently fail in recovery Key: FLINK-8086 URL: https://issues.apache.org/jira/browse/FLINK-8086 Project: Flink Issue

[jira] [Created] (FLINK-8086) FlinkKafkaProducer011 can permanently fail in recovery

2017-11-15 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-8086: - Summary: FlinkKafkaProducer011 can permanently fail in recovery Key: FLINK-8086 URL: https://issues.apache.org/jira/browse/FLINK-8086 Project: Flink Issue

[jira] [Closed] (FLINK-8053) Default to asynchronous snapshots for FsStateBackend

2017-11-14 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-8053. - Resolution: Fixed Merged in 2906698b4a. > Default to asynchronous snapshots for FsStateBack

[jira] [Commented] (FLINK-7873) Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover

2017-11-13 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249823#comment-16249823 ] Stefan Richter commented on FLINK-7873: --- [~sihuazhou] We are already planning to introduce

[jira] [Closed] (FLINK-8040) Test instability ResourceGuard#testCloseBlockIfAcquired

2017-11-13 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-8040. - Resolution: Fixed Merged in ad8ef6d01b. > Test instability ResourceGu

[jira] [Assigned] (FLINK-8053) Default to asynchronous snapshots for FsStateBackend

2017-11-13 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-8053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter reassigned FLINK-8053: - Assignee: Stefan Richter (was: Aljoscha Krettek) > Default to asynchronous snapsh

Re: [PATCH] firewire-ohci: work around oversized DMA reads on JMicron controllers

2017-11-12 Thread Stefan Richter
issue 0x20-byte DMA reads > > +* for descriptors, even 0x10-byte ones. This can cause page faults when > > +* an IOMMU is in use and the oversized read crosses a page boundary. > > + * Work around this by always leaving at least 0x10 bytes of padding. > > +*/ > > + desc->buffer_size = PAGE_SIZE - offset - 0x10; > > desc->buffer_bus = bus_addr + offset; > > desc->used = 0; > > -- Stefan Richter -=== =-== -==-- http://arcgraph.de/sr/

Re: [PATCH] firewire-ohci: work around oversized DMA reads on JMicron controllers

2017-11-12 Thread Stefan Richter
ors, even 0x10-byte ones. This can cause page faults when > > +* an IOMMU is in use and the oversized read crosses a page boundary. > > +* Work around this by always leaving at least 0x10 bytes of padding. > > +*/ > > + desc->buffer_size = PAGE_SIZE - offset - 0x10; > > desc->buffer_bus = bus_addr + offset; > > desc->used = 0; > > -- Stefan Richter -=== =-== -==-- http://arcgraph.de/sr/

[jira] [Comment Edited] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240153#comment-16240153 ] Stefan Richter edited comment on FLINK-7880 at 11/6/17 11:20 AM: - Yes

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-06 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240153#comment-16240153 ] Stefan Richter commented on FLINK-7880: --- Yes, but the test seems to expect that waiting

[jira] [Commented] (FLINK-7880) flink-queryable-state-java fails with core-dump

2017-11-04 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239068#comment-16239068 ] Stefan Richter commented on FLINK-7880: --- My theory is that the `dispose()` is not properly executed

[jira] [Comment Edited] (FLINK-7873) Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover

2017-10-31 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233608#comment-16233608 ] Stefan Richter edited comment on FLINK-7873 at 11/1/17 3:36 AM: The first

[jira] [Commented] (FLINK-7873) Introduce CheckpointCacheManager for reading checkpoint data locally when performing failover

2017-10-31 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16233608#comment-16233608 ] Stefan Richter commented on FLINK-7873: --- The first proposal of this JIRA had some issues

Re: state size effects latency

2017-10-31 Thread Stefan Richter
Hi, I think there are a couple of potential explanations for the increased latency. Let me point out two of the most obvious that come to my mind: 1) A state size of 20 MB sounds like something that could (completely or to a large extend) fit into some cache layer of a modern CPU, whereas 200

[jira] [Assigned] (FLINK-4809) Operators should tolerate checkpoint failures

2017-10-20 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter reassigned FLINK-4809: - Assignee: Stefan Richter > Operators should tolerate checkpoint failu

Re: Custom Sink Checkpointing errors

2017-10-20 Thread Stefan Richter
Hi, the crash looks unrelated to Flink code from the dump’s trace. Since it happens somewhere in managing a jar file, it might be related to this: https://bugs.openjdk.java.net/browse/JDK-8142508 , point (2). Maybe your jar gets overwritten

[jira] [Closed] (FLINK-5372) Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()

2017-10-19 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-5372. - Resolution: Fixed Merged in dbf4c86 >

[jira] [Comment Edited] (FLINK-5372) Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210596#comment-16210596 ] Stefan Richter edited comment on FLINK-5372 at 10/19/17 5:41 AM: - Those

[jira] [Comment Edited] (FLINK-5372) Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210596#comment-16210596 ] Stefan Richter edited comment on FLINK-5372 at 10/19/17 5:23 AM: - Those

[jira] [Commented] (FLINK-5372) Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210596#comment-16210596 ] Stefan Richter commented on FLINK-5372: --- Those are two separate problems and I don't the problem

[jira] [Assigned] (FLINK-5372) Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter reassigned FLINK-5372: - Assignee: Stefan Richter > Fix RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoi

[jira] [Closed] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7757. - Resolution: Fixed Fixed in 7d026aa728. > RocksDB lock is too strict and can block snapsh

[jira] [Closed] (FLINK-7796) RocksDBKeyedStateBackend#RocksDBFullSnapshotOperation should close snapshotCloseableRegistry

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7796. - Resolution: Invalid > RocksDBKeyedStateBackend#RocksDBFullSnapshotOperation should cl

[jira] [Commented] (FLINK-7796) RocksDBKeyedStateBackend#RocksDBFullSnapshotOperation should close snapshotCloseableRegistry

2017-10-18 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16208979#comment-16208979 ] Stefan Richter commented on FLINK-7796: --- It is closed by the creator/owner in {{closeLocalRegistry

[jira] [Commented] (FLINK-6684) Remove AsyncCheckpointRunnable from StreamTask

2017-10-13 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203174#comment-16203174 ] Stefan Richter commented on FLINK-6684: --- Not sure if this issue is actually valid, because most

Re: RocksDB segfault inside timer when accessing/clearing state

2017-10-08 Thread Stefan Richter
Hi, I would assume that those segfaults are only observed *after* a job is already in the process of canceling? This is a known problem, but currently „accepted“ behaviour after discussions with Stephan and Aljoscha (in CC). From that discussion, the background is that the native RocksDB

Re: Stream Task seems to be blocked after checkpoint timeout

2017-10-03 Thread Stefan Richter
l try to investigate what's the problem on my > cluster and S3. > BTW, Is there any Jira issue associated with your improvement, so that I can > track it? > > Best Regards, > Tony Wei > > 2017-10-03 16:01 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com > <mai

[jira] [Created] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

2017-10-03 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7757: - Summary: RocksDB lock is too strict and can block snapshots in synchronous phase Key: FLINK-7757 URL: https://issues.apache.org/jira/browse/FLINK-7757 Project

[jira] [Created] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

2017-10-03 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7757: - Summary: RocksDB lock is too strict and can block snapshots in synchronous phase Key: FLINK-7757 URL: https://issues.apache.org/jira/browse/FLINK-7757 Project

Re: Stream Task seems to be blocked after checkpoint timeout

2017-10-03 Thread Stefan Richter
gt;> 3) It seems memory usage is bounded. I'm not sure if the status showed >>> above is fine. >>> >>> There is only one TM in my cluster for now, so all tasks are running on >>> that machine. I think that means they are in the same JVM, right? >>>

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
t had already arrived at the > operator and started as soon as when the JM triggered the checkpoint? > > Best Regards, > Tony Wei > > 2017-09-28 18:22 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com > <mailto:s.rich...@data-artisans.com>>: >

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
howed what time the asynchronous part took. > > Best Regards, > Tony Wei > > 2017-09-28 16:25 GMT+08:00 Stefan Richter <s.rich...@data-artisans.com > <mailto:s.rich...@data-artisans.com>>: > Hi, > > when the async part takes that long I would have 3 t

Re: Flink on EMR

2017-09-28 Thread Stefan Richter
Hi, for issue 1, you could delete the slf4j jar from Flink’s lib folder, but I wonder if this producing any problems even with the warning? For issue 2, my question is where you found that only 5GB have been allocated? Did you consider that Flink only allocates a fraction of the memory for

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
k #1245 > and #1246, you can check the picture I sent before. > > > > AFAIK, the back pressure rate usually is in LOW status, sometimes goes up to > HIGH, and always OK during the night. > > Best Regards, > Tony Wei > > > 2017-09-27 16:54 GMT+08:00 Ste

Re: Flink Application Jar file on Docker container

2017-09-27 Thread Stefan Richter
12:39 schrieb Rahul Raj <rahulrajms...@gmail.com>: > > Hi Stefan, > > I have a question in my mid out of curiosity Is it possible to run flink > application within docker container by using flink cluster set up on host? > > Rahul Raj > > On 26 September 2

[jira] [Created] (FLINK-7720) Centralize creation of backends and state related resources

2017-09-27 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7720: - Summary: Centralize creation of backends and state related resources Key: FLINK-7720 URL: https://issues.apache.org/jira/browse/FLINK-7720 Project: Flink

[jira] [Created] (FLINK-7719) Send checkpoint id to task as part of deployment descriptor when resuming

2017-09-27 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7719: - Summary: Send checkpoint id to task as part of deployment descriptor when resuming Key: FLINK-7719 URL: https://issues.apache.org/jira/browse/FLINK-7719 Project

[jira] [Created] (FLINK-7719) Send checkpoint id to task as part of deployment descriptor when resuming

2017-09-27 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7719: - Summary: Send checkpoint id to task as part of deployment descriptor when resuming Key: FLINK-7719 URL: https://issues.apache.org/jira/browse/FLINK-7719 Project

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-27 Thread Stefan Richter
dcc0ace2b72c2/chk-1245/eedd7ca5-ee34-45a5-bf0b-11cc1fc67ab8 > 2017-09-26 18:13:34 37419 > tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1246/9aa5c6c4-8c74-465d-8509-5fea4ed25af6 > > Hope these informations are helpful. Thank you. > > Best Regards, > Tony W

Re: Exception in BucketingSink when cancelling Flink job

2017-09-27 Thread Stefan Richter
Hi, I would speculate that the reason for this order is that we want to shutdown the tasks quickly by interrupting blocking calls in the event of failure, so that recover can begin as fast as possible. I am looping in Stephan who might give more details about this code. Best, Stefan > Am

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-27 Thread Stefan Richter
ess(CountAndTimeoutWindow) > .asyncIO(UploadToS3) > .addSink(UpdateDatabase) > > It seemed all tasks stopped like the picture I sent in the last email. > > I will keep my eye on taking a thread dump from that JVM if this happens > again. > > Best Regards, > To

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-26 Thread Stefan Richter
Hi, that is very strange indeed. I had a look at the logs and there is no error or exception reported. I assume there is also no exception in your full logs? Which version of flink are you using and what operators were running in the task that stopped? If this happens again, would it be

Re: Flink Application Jar file on Docker container

2017-09-26 Thread Stefan Richter
Hi, as in my answer to your previous mail, I suggest to take a look at https://github.com/mesoshq/flink . Unfortunately, there is not yet a lot documentation about the internals of how this works, so I am also looping in Till who might know more about

Re: Flink Job on Docker on Mesos cluster

2017-09-26 Thread Stefan Richter
Hi, I think what you need to have is a docker image that can spawn task managers as entry points. Please take a look at this project which gives some more detailed explanation: https://github.com/mesoshq/flink In particular, take a look at the shell script

Re: Questions about checkpoints/savepoints

2017-09-26 Thread Stefan Richter
Hi, I have answered your questions inline: > It seems to me that checkpoints can be treated as flink internal recovery > mechanism, and savepoints act more as user-defined recovery points. Would > that be a correct assumption? You could see it that way, but I would describe savepoints more as

Re: java.lang.OutOfMemoryError: Java heap space at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)

2017-09-26 Thread Stefan Richter
Hi, could you give us some more information like the size of your heap space, information about where and how you implemented access to Redis and how you kep the retrieved data, and most importantly a stack trace or (much better) a log? Best, Stefan > Am 26.09.2017 um 06:52 schrieb

Re: How to clear registered timers for a merged window?

2017-09-26 Thread Stefan Richter
Hi, I think that it is currently not possible to delete timers that did not trigger, because currently some of the data structures used for timers do not support random deletes efficiently. For the second part of the question about keeping the state of merged windows, I added Aljoscha in CC

Re: [PATCH] tools: firewire: nosy-dump: fix a resource leak in main()

2017-09-26 Thread Stefan Richter
t) > + fclose(input); > return -1; > } > } When we return from main(), all files are closed implicitly. -- Stefan Richter -=== =--= ==-=- http://arcgraph.de/sr/

Re: [PATCH] tools: firewire: nosy-dump: fix a resource leak in main()

2017-09-26 Thread Stefan Richter
ose(input); > return -1; > } > } When we return from main(), all files are closed implicitly. -- Stefan Richter -=== =--= ==-=- http://arcgraph.de/sr/

[jira] [Closed] (FLINK-7619) Improve abstraction in AbstractAsyncIOCallable to better fit

2017-09-25 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7619. - Resolution: Fixed Fix Version/s: 1.4.0 Merged in 5af463a9c0. > Improve abstract

[jira] [Closed] (FLINK-7524) Task "xxx" did not react to cancelling signal, but is stuck in method

2017-09-25 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7524. - Assignee: Stefan Richter I still found some small potential for improvement that is merged

Re: Empty directories left over from checkpointing

2017-09-20 Thread Stefan Richter
Hi, We recently removed some cleanup code, because it involved checking some store meta data to check when we can delete a directory. For certain stores (like S3), requesting this meta data whenever we delete a file was so expensive that it could bring down the job because removing state could

Re: [DISCUSS] Dropping Scala 2.10

2017-09-20 Thread Stefan Richter
+1

Re: Could not initialize keyed state backend.

2017-09-18 Thread Stefan Richter
Hi, that is not the case, and it also would not make too much sense if you think about restoring from a checkpoint in case of a machine failure. Is there a section in the Flink documentation that was confusing and has brought you to this assumption? Best, Stefan > Am 18.09.2017 um 15:56

Re: Could not initialize keyed state backend.

2017-09-18 Thread Stefan Richter
Hi, are your checkpoints going against a local filesystem or against a distributed filesystem that is reachable from all task managers. This exception can happen in the first case: imagine your task restarts on a different machine, how could it find a file that was local to a different

[jira] [Created] (FLINK-7619) Improve abstraction in AbstractAsyncIOCallable to better fit

2017-09-14 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7619: - Summary: Improve abstraction in AbstractAsyncIOCallable to better fit Key: FLINK-7619 URL: https://issues.apache.org/jira/browse/FLINK-7619 Project: Flink

[jira] [Created] (FLINK-7619) Improve abstraction in AbstractAsyncIOCallable to better fit

2017-09-14 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7619: - Summary: Improve abstraction in AbstractAsyncIOCallable to better fit Key: FLINK-7619 URL: https://issues.apache.org/jira/browse/FLINK-7619 Project: Flink

Re: Task Manager segfault randomly with RocksDB

2017-09-12 Thread Stefan Richter
Hi, could you provide some logs and ideally also the core dumps from the segfault? Is this only happening when the job is already canceling? Best, Stefan > Am 12.09.2017 um 10:49 schrieb Kien Truong : > > Hi all, > > Our task managers are segfaulting randomly when

Re: Quick checkpointing related question

2017-09-08 Thread Stefan Richter
Hi, the method is only called after the checkpoint completed on the job manager. At this point _all_ work for the checkpoint is done, so doing work in this callback does not add any overhead to the checkpoint. Best, Stefan > Am 08.09.2017 um 10:20 schrieb Martin Eden

Re: File System State Backend

2017-09-08 Thread Stefan Richter
Hi, I just tried out checkpoint with FsStateBackend in 1.3.2 and everything works as expected for me. Can you give a bit more detail what you mean by „checkpoint data is not cleaning“? For example, is it not cleaned up while the job is running and accumulating „chk-[ID]“ directories or is

Re: MapState Default Value

2017-09-07 Thread Stefan Richter
I don’t know any better answer than: it was never implemented. But it could make sense, so there is no deeper reason from my point of view. > Am 07.09.2017 um 14:49 schrieb Timo Walther : > > I will loop in Stefan, who might know the answer. > > > Am 07.09.17 um 02:10

Re: Empty state restore seems to be broken for Kafka source (1.3.2)

2017-09-06 Thread Stefan Richter
Thanks for the report, I will take a look. > Am 06.09.2017 um 11:48 schrieb Gyula Fóra : > > Hi all, > > We are running into some problems with the kafka source after changing the > uid and restoring from the savepoint. > What we are expecting is to clear the partition

Re: External checkpoints not getting cleaned up/discarded - potentially causing high load

2017-09-04 Thread Stefan Richter
de some profiler data as we are able to > analyze further. > > -- > Jared Stehler > Chief Architect - Intellify Learning > o: 617.701.6330 x703 > > > >> On Jul 3, 2017, at 6:02 AM, Stefan Richter <s.rich...@data-artisans.com >> <mailto:s.rich...@data-

[jira] [Closed] (FLINK-7460) Close all ColumnFamilyHandles when restoring rescaled incremental checkpoints

2017-08-24 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7460. - Resolution: Fixed merged in ca87bec4f7. > Close all ColumnFamilyHandles when restoring resca

[jira] [Closed] (FLINK-7505) Use lambdas in suppressed exception idiom

2017-08-24 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7505. - Resolution: Fixed merged in 5456cf9f8f. > Use lambdas in suppressed exception id

[jira] [Closed] (FLINK-7461) Remove Backwards compatibility for Flink 1.1 from Flink 1.4

2017-08-24 Thread Stefan Richter (JIRA)
[ https://issues.apache.org/jira/browse/FLINK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Richter closed FLINK-7461. - Resolution: Fixed merged in 6642768ad8. > Remove Backwards compatibility for Flink 1.1 f

Re: Database connection from job

2017-08-24 Thread Stefan Richter
Hi, the lifecycle is described here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/task_lifecycle.html Best, Stefan > Am 24.08.2017 um 14:12 schrieb Bart Kastermans

[jira] [Created] (FLINK-7505) Use lambdas in suppressed exception idiom

2017-08-24 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7505: - Summary: Use lambdas in suppressed exception idiom Key: FLINK-7505 URL: https://issues.apache.org/jira/browse/FLINK-7505 Project: Flink Issue Type

[jira] [Created] (FLINK-7505) Use lambdas in suppressed exception idiom

2017-08-24 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-7505: - Summary: Use lambdas in suppressed exception idiom Key: FLINK-7505 URL: https://issues.apache.org/jira/browse/FLINK-7505 Project: Flink Issue Type

Re: custom writer fail to recover

2017-08-24 Thread Stefan Richter
Hi, I think there are two different things mixed up in your analysis. The stack trace that you provided is caused by a failing checkpoint - in writing, not in reading. It seems to fail from a Timeout of your HDFS connection. This close method has also nothing to do with the close method in the

Re: Support for multiple HDFS

2017-08-24 Thread Stefan Richter
Hi, I don’t think that this is currently supported. If you see a use case for this (over creating different root directories for checkpoint data and result data) then I suggest that you open a JIRA issue with a new feature request. Best, Stefan > Am 23.08.2017 um 20:17 schrieb Vijay

Re: [DISCUSS] Stop serving docs for Flink version prior to 1.0

2017-08-22 Thread Stefan Richter
+1 > Am 22.08.2017 um 18:07 schrieb Robert Metzger : > > Please go ahead +1 > > Thank you for taking care! > > On Tue, Aug 22, 2017 at 6:02 PM, Aljoscha Krettek > wrote: > >> +1 >>> On 22. Aug 2017, at 18:01, Stephan Ewen wrote:

Re: [POLL] Dropping savepoint format compatibility for 1.1.x in the Flink 1.4.0 release

2017-08-17 Thread Stefan Richter
able) and the > key-group-oriented format (1.2.x onwards, re-scalable). It would greatly help > the development of state and checkpointing features to drop that old code.” > > Greg > > >> On Aug 17, 2017, at 5:36 AM, Stefan Richter <s.rich...@data-artisans.com> &g

Re: Why ListState of flink don't support update?

2017-08-17 Thread Stefan Richter
Hi, this is because the list state is intended to be append only. The underlying reason is that this allows certain optimizations in the underlying datastructures. For example, a list state for the RocksDB backend can make use of RocksDB’s merge operator and does not require a full rewrite to

Re: Change state backend.

2017-08-17 Thread Stefan Richter
This is not possible out of the box. Historically, the checkpoint/savepoint formats have been different between heap based and RocksDB based backends. We have already eliminated most differences in 1.3. However, there are two problems remaining. The first problem is just how the number of

<    5   6   7   8   9   10   11   12   13   14   >