Hi Sebastian! I think this is the right place to ask.
In principle, there are no strong hardware requirements. (Of course, more main memory and higher I/O bandwidth always help). The memory size requirement does not grow with the data size, since the system spills to disk, if needed. The most important point is the one you touched already, the number of network buffers. Since the current version can only do streaming exchanges, you need enough buffers to cover all streams. The rough formula for that is: #slots * parallelism * 2 * N (where N is the number of concurrent shuffles you plan to have). Typically a N of 4 is enough. (The slots is the scheduling unit staring in 0.6. In 0.5 and earlier, you can think #cores instead of #slots). (Explanation: When shuffling, each task slot will need two buffers (send side and receive side) for each target (parallelism many). In future versions, we plan to automatically distribute memory to the network stack, but right now this is a parameter to adjust manually. NOTE: There is currently a shortcoming that makes the memory requirement grow with the length of the processing pipeline. This is on our list to solve soon. Let me know if you have further questions! Stephan On Fri, Jul 4, 2014 at 3:46 PM, Kruse, Sebastian <sebastian.kr...@hpi.de> wrote: > Hi everyone, > > I apologize in advance if that is not the right mailing list for my > question. If there is a better place for it, please let me know. > > Basically, I wanted to ask if you have some statement about the hardware > requirements of Flink to process larger amounts of data beginning from, > say, 20 GBs. Currently, I am facing issues in my jobs, e.g., there are not > enough buffers for safe execution of some operations. Since the machines > that run my TaskTrackers have unfortunately very limited main memory, I > cannot increase the number of buffers (and heap space in general) too much. > Currently, I assigned them 1.5 GB. > > So, the exact questions are: > > * Do you have experiences with a suitable HW setup for crunching > larger amounts of data, maybe from the TU cluster? > > * Are there any configuration tips, you can provide, e.g. > pertaining to the buffer configuration? > > * Are there any general statements on the growth of Flink's memory > requirements wrt. to the size of the input data? > > Thanks for your help! > Sebastian >