Re: Batch mode with Flink 1.8 unstable?

2019-09-19 Thread Fabian Hueske
Hi Ken, Changing the parallelism can affect the generation of input splits. I had a look at BinaryInputFormat, and it adds a bunch of empty input splits if the number of generated splits is less than the minimum number of splits (which is equal to the parallelism). See -->

Re: Batch mode with Flink 1.8 unstable?

2019-09-19 Thread Till Rohrmann
Good to hear that some of your problems have been solved Ken. For the UTFDataFormatException it is hard to tell. Usually it says that the input has been produced using `writeUTF`. Cloud you maybe provide an example program which reproduces the problem? Moreover, it would be helpful to see how the

Re: Batch mode with Flink 1.8 unstable?

2019-09-18 Thread Ken Krugler
Hi Till, I tried out 1.9.0 with my workflow, and I no longer am running into the errors I described below, which is great! Just to recap, this is batch, per-job mode on YARN/EMR. Though I did run into a new issue, related to my previous problem when reading files written via

Re: Batch mode with Flink 1.8 unstable?

2019-07-02 Thread Till Rohrmann
Thanks for the update Ken. The input splits seem to be org.apache.hadoop.mapred.FileSplit. Nothing too fancy pops into my eye. Internally they use org.apache.hadoop.mapreduce.lib.input.FileSplit which stores a Path, two long pointers and two string arrays with hosts and host infos. I would assume

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Ken Krugler
Hi Stephan, Thanks for responding, comments inline below… Regards, — Ken > On Jun 26, 2019, at 7:50 AM, Stephan Ewen wrote: > > Hi Ken! > > Sorry to hear you are going through this experience. The major focus on > streaming so far means that the DataSet API has stability issues at scale. >

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Ken Krugler
Hi Till, Thanks for following up. I’ve got answers to other emails on this thread pending, but wanted to respond to this one now. > On Jul 1, 2019, at 7:20 AM, Till Rohrmann wrote: > > Quick addition for problem (1): The AkkaRpcActor should serialize the > response if it is a remote RPC and

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Till Rohrmann
Quick addition for problem (1): The AkkaRpcActor should serialize the response if it is a remote RPC and send an AkkaRpcException if the response's size exceeds the maximum frame size. This should be visible on the call site since the future should be completed with this exception. I'm wondering

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Till Rohrmann
Hi Ken, in order to further debug your problems it would be helpful if you could share the log files on DEBUG level with us. For problem (2), I suspect that it has been caused by Flink releasing TMs too early. This should be fixed with FLINK-10941 which is part of Flink 1.8.1. The 1.8.1 release

Re: Batch mode with Flink 1.8 unstable?

2019-06-27 Thread Biao Liu
Hi Ken again, In regard to TimeoutException, I just realized that there is no akka.remote.OversizedPayloadException in your log file. There might be some other reason caused this. 1. Have you ever tried increasing the configuration "akka.ask.timeout"? 2. Have you ever checked the garbage

Re: Batch mode with Flink 1.8 unstable?

2019-06-26 Thread Biao Liu
Hi Ken, In regard to oversized input splits, it seems to be a rare case beyond my expectation. However it should be fixed definitely since input split can be user-defined. We should not assume it must be small. I agree with Stephan that maybe there is something unexpectedly involved in the input

Re: Batch mode with Flink 1.8 unstable?

2019-06-26 Thread qi luo
Hi Stephan, We have met similar issues described as Ken. Would all these issues be hopefully fixed in 1.9? Thanks, Qi > On Jun 26, 2019, at 10:50 PM, Stephan Ewen wrote: > > Hi Ken! > > Sorry to hear you are going through this experience. The major focus on > streaming so far means that

Re: Batch mode with Flink 1.8 unstable?

2019-06-26 Thread Stephan Ewen
Hi Ken! Sorry to hear you are going through this experience. The major focus on streaming so far means that the DataSet API has stability issues at scale. So, yes, batch mode in current Flink version can be somewhat tricky. It is a big focus of Flink 1.9 to fix the batch mode, finally, and by

Batch mode with Flink 1.8 unstable?

2019-06-23 Thread Ken Krugler
Hi all, I’ve been running a somewhat complex batch job (in EMR/YARN) with Flink 1.8.0, and it regularly fails, but for varying reasons. Has anyone else had stability with 1.8.0 in batch mode and non-trivial workflows? Thanks, — Ken 1. TimeoutException getting input splits The batch job