Hi,

I admit, it really is already quite a lot :)

However, my task at hand is inclusion dependency detection on CSV files and the 
number of such files in real-world datasets is sometimes even higher. Since 
each file can have a different number  of columns and since I need to 
distinguish the columns from all files, I am starting a source per file.

How would you recommend to set the DOP for a cluster? Number of machines? 
Number of cores?  Number  of cores*2?

Cheers,
Sebastian

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stephan 
Ewen
Sent: Montag, 7. Juli 2014 19:44
To: [email protected]
Subject: Re: Hardware Requirements

Hi!

Okay, 100 concurrent data sources is quite a lot ;-)

Do you start a source per file? You can start a source per directory, which 
will take all files in the directory...

Stephan


On Mon, Jul 7, 2014 at 7:41 PM, Kruse, Sebastian <[email protected]>
wrote:

> Thanks for your answers. Based on what you say, I guess the scaling 
> problem in my program is the number of data sources. This number is 
> variable and can go beyond 100 (I am analyzing data dumps). Maybe, the 
> number of shuffles or something similar will grow with the number of 
> sources or simply because it inflates the plan. That would explain, 
> why the execution fails for the larger datasets.
>
> I am running 10 TaskManagers. Since these have dual-core CPUs and I 
> thought, I chose 20 as DOP, and was even thinking about 40 for latency 
> hiding. What DOP would you suggest for this setting (disregarding the 
> buffer limitation)?
>
> Pertaining to the number of concurrent shuffles, I would also like to 
> know what causes a shuffle. Reduces, cogroups, and joins? And what about 
> unions?
>
> If you are interested, I can play around a little bit more with the 
> settings  by the end of this week and report to you, under which 
> circumstances the execution fails or passes.
> (Update: the program just passed with 16000 buffers and a DOP of 10)
>
> Cheers,
> Sebastian
>
>
> -----Original Message-----
> From: Ufuk Celebi [mailto:[email protected]]
> Sent: Sonntag, 6. Juli 2014 14:30
> To: [email protected]
> Subject: Re: Hardware Requirements
>
> Hey Sebastian,
>
> did you already try to increase the number of buffers in accordance to 
> Stephan's suggestion? The current defaults for the number and size of 
> network buffers are 2048 and 32768 bytes, resulting in 64 MB of memory 
> for the network buffers.
>
> Out of curiosity: on how many machines are you running your job and 
> what parallelism did you set for your program?
>
> Best,
>
> Ufuk
>
> On 04 Jul 2014, at 15:46, Kruse, Sebastian <[email protected]> wrote:
>
> > Hi everyone,
> >
> > I apologize in advance if that is not the right mailing list for my
> question. If there is a better place for it, please let me know.
> >
> > Basically, I wanted to ask if you have some statement about the 
> > hardware
> requirements of Flink to process larger amounts of data beginning 
> from, say, 20 GBs. Currently, I am facing issues in my jobs, e.g., 
> there are not enough buffers for safe execution of some operations. 
> Since the machines that run my TaskTrackers have unfortunately very 
> limited main memory, I cannot increase the number of buffers (and heap space 
> in general) too much.
> Currently, I assigned them 1.5 GB.
> >
> > So, the exact questions are:
> >
> > *         Do you have experiences with a suitable HW setup for crunching
> larger amounts of data, maybe from the TU cluster?
> >
> > *         Are there any configuration tips, you can provide, e.g.
> pertaining to the buffer configuration?
> >
> > *         Are there any general statements on the growth of Flink's
> memory requirements wrt. to the size of the input data?
> >
> > Thanks for your help!
> > Sebastian
>
>

Reply via email to