Any advice on how to replay an event-timed stream?

2019-01-17 Thread Kanstantsin Kamkou
Thanks for the reply. As mentioned before the data comes from the database. Timestams are from one months ago. And I’m searching a way on how to dump this data into a working flink application which already processed this data (watermarks are far away from those dates). On Fri 18. Jan 2019 at

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread knur
Hello Jamie. Thanks for taking a look at this. So, yes, I want to write only the last data for each key every X minutes. In other words, I want a snapshot of the whole database every X minutes. > The issue is that the window never get's PURGED so the data just > continues to accumulate in the

Re: Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread Jamie Grier
If I'm understanding you correctly you're just trying to do some data reduction so that you write data for each key once every five minutes rather than for every CDC update.. Is that correct? You also want to keep the state for most recent key you've ever seen so you don't apply writes out of

Re: Any advice on how to replay an event-timed stream?

2019-01-17 Thread Jamie Grier
I don't think I understood all of your question but with regard to the watermarking and keys.. You are correct that watermarking (event time advancement) is not per key. Event-time is a local property of each Task in an executing Flink job. It has nothing to do with keys. It has only to do

Any advice on how to replay an event-timed stream?

2019-01-17 Thread Kanstantsin Kamkou
Hi guys! As I understood (I hope I’m wrong) the current design concept of the watermarking mechanism is that it tight to the latest watermark and there is no way to separate those watermarks by key in keyed stream (I hope at some point it’l be mentioned in the documentation as it unfortunately

Dump snapshot of big table in real time using StreamingFileSink

2019-01-17 Thread knur
Hello there. So we have some Postgres tables that are mutable, and we want to create a snapshot of them in S3 every X minutes. So we plan to use Debezium to send a CDC log of every row change into a Kafka topic, and then have Flink keep the latest state of each row to save that data into S3

Re: Issue with counter metrics for large number of keys

2019-01-17 Thread Jamie Grier
+1 to what Zhenghua said. You're abusing the metrics system I think. Rather just do a stream.keyBy().sum() and then write a Sink to do something with the data -- for example push it to your metrics system if you wish. However, from experience, many metrics systems don't like that sort of thing.

Re: Getting RemoteTransportException

2019-01-17 Thread Jamie Grier
Avi, The stack trace there is pretty much a red herring. That happens whenever a job shuts down for any reason and is not a root cause. To diagnose this you will want to look at all the TaskManager logs as well as the JobManager logs. If you have a way to easily grep these (all of them at

`env.java.opts` not persisting after job canceled or failed and then restarted

2019-01-17 Thread Aaron Levin
Hello! *tl;dr*: settings in `env.java.opts` seem to stop having impact when a job is canceled or fails and then is restarted (with or without savepoint/checkpoints). If I restart the task-managers, the `env.java.opts` seem to start having impact again and our job will run without failure. More

Re: Getting RemoteTransportException

2019-01-17 Thread Dominik Wosiński
*Hey,* As for the question about taskmanager.network.netty.server.numThreads . It is the size of the thread pool that will be used by the netty server. The default value is -1,

Re: delete all available flink timers on app start

2019-01-17 Thread Steven Wu
Vipul, it sounds like you don't want to checkpoint timer at all. since 1.7, we can configure timer state backend (HEAP/ROCKSDB). I guess a new option (NONE) can be added to support such requirement. but it is interesting to see your reasons. can you elaborate? thanks, Steven On Thu, Jan 17,

Re: delete all available flink timers on app start

2019-01-17 Thread Fabian Hueske
Hi Vipul, I'm not aware of a way to do this. You could have a list of all registered timers per key as state to be able to delete them. However, the problem is to identify in user code when an application was restarted, i.e., to know when to delete timers. Also, timer deletion would need to be

YARN reserved container prevents new Flink TMs

2019-01-17 Thread suraj7
Hi, I am using Amazon EMR(emr-5.20.0, hadoop: Amazon 2.8.5, Flink: 1.6.2) to run Flink Cluster on YARN. My setup consists of m4.large instances for 1 master and 3 core nodes. I have Flink Cluster running on YARN with the command: flink-yarn-session -tm 5120 -s 3 -jm 1024. This setup should

Re: EXT :Re: StreamingFileSink cannot get AWS S3 credentials

2019-01-17 Thread Vinay Patil
Hi Stephan., Yes, we tried setting fs.s3a.aws.credentials.provider but we are getting class not found exception for InstanceProfileCredentialsProvider because of shading issue. Regards, Vinay Patil On Thu, Jan 17, 2019 at 3:02 PM Stephan Ewen wrote: > Regarding configurations: According to

Re: EXT :Re: StreamingFileSink cannot get AWS S3 credentials

2019-01-17 Thread Stephan Ewen
Regarding configurations: According to the code [1] , all config keys starting with "s3", "s3a" and "fs.s3a" are forwarded from the