Hi, For a node with M gigabytes of memory and N total child tasks (both map + reduce) running on the node, what do people typically use for the following parameters:
- Xmx (heap size per child task JVM)? I.e. my question here is what percentage of the total memory node do you use for the heaps of the tasks' JVMs. I am trying to reuse JVMs, and there are roughly N task-JVMs on one node at any time. I 've tried using a very large chunk of my memory of my node for heaps (i.e. close to M/N) and I have seen better execution times without experiencing swapping; but I am wondering if this is a job-specific behaviour. When I 've used both -Xmx and -Xms set to the same heap size (i.e. maximum and minum heap size the same to avoid contraction and expansion overheads) I have run into some swapping; I guess Xms=Xmx should be avoided if we are close to the physical memory limit. - io.sort.mb and io.sort.factor. I understand that to answer this we 'd have to take the disk configuration into consideration. Do you consider this only a function of disk or also a function of the heap size? Obviously io.sort.mb < heapsize, but how much space do you leave for non-sort buffer usage? I am interested in small cluster setups ( 8-16 nodes), and not large clusters, if that makes any difference. - Vasilis