Hi! Okay, 100 concurrent data sources is quite a lot ;-)
Do you start a source per file? You can start a source per directory, which will take all files in the directory... Stephan On Mon, Jul 7, 2014 at 7:41 PM, Kruse, Sebastian <[email protected]> wrote: > Thanks for your answers. Based on what you say, I guess the scaling > problem in my program is the number of data sources. This number is > variable and can go beyond 100 (I am analyzing data dumps). Maybe, the > number of shuffles or something similar will grow with the number of > sources or simply because it inflates the plan. That would explain, why the > execution fails for the larger datasets. > > I am running 10 TaskManagers. Since these have dual-core CPUs and I > thought, I chose 20 as DOP, and was even thinking about 40 for latency > hiding. What DOP would you suggest for this setting (disregarding the > buffer limitation)? > > Pertaining to the number of concurrent shuffles, I would also like to know > what causes a shuffle. Reduces, cogroups, and joins? And what about unions? > > If you are interested, I can play around a little bit more with the > settings by the end of this week and report to you, under which > circumstances the execution fails or passes. > (Update: the program just passed with 16000 buffers and a DOP of 10) > > Cheers, > Sebastian > > > -----Original Message----- > From: Ufuk Celebi [mailto:[email protected]] > Sent: Sonntag, 6. Juli 2014 14:30 > To: [email protected] > Subject: Re: Hardware Requirements > > Hey Sebastian, > > did you already try to increase the number of buffers in accordance to > Stephan's suggestion? The current defaults for the number and size of > network buffers are 2048 and 32768 bytes, resulting in 64 MB of memory for > the network buffers. > > Out of curiosity: on how many machines are you running your job and what > parallelism did you set for your program? > > Best, > > Ufuk > > On 04 Jul 2014, at 15:46, Kruse, Sebastian <[email protected]> wrote: > > > Hi everyone, > > > > I apologize in advance if that is not the right mailing list for my > question. If there is a better place for it, please let me know. > > > > Basically, I wanted to ask if you have some statement about the hardware > requirements of Flink to process larger amounts of data beginning from, > say, 20 GBs. Currently, I am facing issues in my jobs, e.g., there are not > enough buffers for safe execution of some operations. Since the machines > that run my TaskTrackers have unfortunately very limited main memory, I > cannot increase the number of buffers (and heap space in general) too much. > Currently, I assigned them 1.5 GB. > > > > So, the exact questions are: > > > > * Do you have experiences with a suitable HW setup for crunching > larger amounts of data, maybe from the TU cluster? > > > > * Are there any configuration tips, you can provide, e.g. > pertaining to the buffer configuration? > > > > * Are there any general statements on the growth of Flink's > memory requirements wrt. to the size of the input data? > > > > Thanks for your help! > > Sebastian > >
