Re: swapping on hadoop
On Apr 1, 2010, at 5:04 PM, Vasilis Liaskovitis wrote: >> > > ok. Now, considering a map side space buffer and a sort buffer, do > both account for tenured space for both map and reduce JVMs? I 'd > think the map side buffer gets used and tenured for map tasks and the > sort space gets used and tenured for the reduce task during sort/merge > phase. Would both spaces really be used in both kinds of tasks? > It is my understanding that a JVM used for a map won't also be used for a reduce. The JVM reuse runs multiple maps or reduces in one process but not across both. The mapper does the majority of the sorting, the reducer mostly merges pre-sorted data. Each kind of task tends to have a different memory footprint, dependent on the job and data. >> The maximum number of map and reduce tasks per node applies no matter how >> many jobs are running. > > RIght. But depending on your job scheduler, isn't it possible that you > may be swapping the different jobs' JVM space in and out of physical > memory while scheduling all the parallel jobs? Especially if nodes > don't have huge amounts of memory, this scenario sounds likely. > To be more precise, the max number of map and reduce tasks corresponds with the maximum number of active JVMs of each type at the same time. When a job finishes all of its tasks, the JVMs for it are killed. A new job gets new JVMs. Running concurrent jobs means that each job has some fraction of these JVM slots occupied. So, there should be no swapping different Jobs JVMs in and out of RAM. The same number of active JVM's exists for one large job as it does for 4 concurrent jobs. >> > > Back to a single job running and assuming all heap space being used, > what percentage of a node's memory would you leave for other functions > esp. disk cache? I currently only have 25% of memory (~4GB) for > non-heapJVM data; I guess there should be a sweet-spot, probably > dependent on the job I/O characteristics. > It will dependon the job, its I/O, and the OS tuning. But 25% to 33% of memory for system file cache has worked for me (remember, the nodes aren't just for tasks, but also for HDFS). A small amount of swap-out isn't bad, since the JVM's expire and never swap-in. > - Vasilis
Re: swapping on hadoop
Hi, On Thu, Apr 1, 2010 at 2:02 PM, Scott Carey wrote: >> In this example, what hadoop config parameters do the above 2 buffers >> refer to? io.sort.mb=250, but which parameter does the "map side join" >> 100MB refer to? Are you referring to the split size of the input data >> handled by a single map task? > > "Map side join" in just an example of one of many possible use cases where a > particular map implementation may hold on to some semi-permanent data for the > whole task. > It could be anything that takes 100MB of heap and holds the data across > individual calls to map(). > ok. Now, considering a map side space buffer and a sort buffer, do both account for tenured space for both map and reduce JVMs? I 'd think the map side buffer gets used and tenured for map tasks and the sort space gets used and tenured for the reduce task during sort/merge phase. Would both spaces really be used in both kinds of tasks? > > Java typically uses 5MB to 60MB for classloader data (statics, classes) and > some space for threads, etc. The default thread stack on most OS's is about > 1MB, and the number of threads for a task process is on the order of a dozen. > Getting 2-3x the space in a java process outside the heap would require > either a huge thread count, a large native library loaded, or perhaps a > non-java hadoop job using pipes. > It would be rather obvious in 'top' if you sort by memory (shift-M on linux), > or vmstat, etc. To get the current size of the heap of a process, you can > use jstat or 'kill -3' to create a stack dump and heap summary. > thanks, good to know. >> >> With this new setup, I don't normally get swapping for a single job >> e.g. terasort or hive job. However, the problem in general is >> exacerbated if one spawns multiple indepenendent hadoop jobs >> simultaneously. I 've noticed that JVMs are not re-used across jobs, >> in an earlier post: >> http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html >> This implies that Java memory usage would blow up when submitting >> multiple independent jobs. So this multiple job scenario sounds more >> susceptible to swapping >> > The maximum number of map and reduce tasks per node applies no matter how > many jobs are running. RIght. But depending on your job scheduler, isn't it possible that you may be swapping the different jobs' JVM space in and out of physical memory while scheduling all the parallel jobs? Especially if nodes don't have huge amounts of memory, this scenario sounds likely. > >> A relevant question is: in production environments, do people run jobs >> in parallel? Or is it that the majority of jobs is a serial pipeline / >> cascade of jobs being run back to back? >> > Jobs are absolutely run in parallel. I recommend using the fair scheduler > with no config parameters other than 'assignmultiple = true' as the > 'baseline' scheduler, and adjust from there accordingly. The Capacity > Scheduler has more tuning knobs for dealing with memory constraints if jobs > have drastically different memory needs. The out-of-the-box FIFO scheduler > tends to have a hard time keeping the cluster utilization high when there are > multiple jobs to run. thanks, I 'll try this. Back to a single job running and assuming all heap space being used, what percentage of a node's memory would you leave for other functions esp. disk cache? I currently only have 25% of memory (~4GB) for non-heapJVM data; I guess there should be a sweet-spot, probably dependent on the job I/O characteristics. - Vasilis
Re: swapping on hadoop
On Apr 1, 2010, at 8:38 AM, Vasilis Liaskovitis wrote: > > In this example, what hadoop config parameters do the above 2 buffers > refer to? io.sort.mb=250, but which parameter does the "map side join" > 100MB refer to? Are you referring to the split size of the input data > handled by a single map task? Apart from that question, the example is > clear to me and useful, thanks. > "Map side join" in just an example of one of many possible use cases where a particular map implementation may hold on to some semi-permanent data for the whole task. It could be anything that takes 100MB of heap and holds the data across individual calls to map(). > > Quoting Allen: "Java takes more RAM than just the heap size. > Sometimes 2-3x as much." > Is there a clear indication that Java memory usage extends so far > beyond its allocated heap? E.g. would java thread stacks really > account for such a big increase 2x to 3x? Tasks seem to be heavily > threaded. What are the relevant config options to control number of > threads within a task? > Java typically uses 5MB to 60MB for classloader data (statics, classes) and some space for threads, etc. The default thread stack on most OS's is about 1MB, and the number of threads for a task process is on the order of a dozen. Getting 2-3x the space in a java process outside the heap would require either a huge thread count, a large native library loaded, or perhaps a non-java hadoop job using pipes. It would be rather obvious in 'top' if you sort by memory (shift-M on linux), or vmstat, etc. To get the current size of the heap of a process, you can use jstat or 'kill -3' to create a stack dump and heap summary. > > With this new setup, I don't normally get swapping for a single job > e.g. terasort or hive job. However, the problem in general is > exacerbated if one spawns multiple indepenendent hadoop jobs > simultaneously. I 've noticed that JVMs are not re-used across jobs, > in an earlier post: > http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html > This implies that Java memory usage would blow up when submitting > multiple independent jobs. So this multiple job scenario sounds more > susceptible to swapping > The maximum number of map and reduce tasks per node applies no matter how many jobs are running. > A relevant question is: in production environments, do people run jobs > in parallel? Or is it that the majority of jobs is a serial pipeline / > cascade of jobs being run back to back? > Jobs are absolutely run in parallel. I recommend using the fair scheduler with no config parameters other than 'assignmultiple = true' as the 'baseline' scheduler, and adjust from there accordingly. The Capacity Scheduler has more tuning knobs for dealing with memory constraints if jobs have drastically different memory needs. The out-of-the-box FIFO scheduler tends to have a hard time keeping the cluster utilization high when there are multiple jobs to run. > thanks, > > - Vasilis
Re: swapping on hadoop
For multiple indepenendent hadoop jobs, there is no easy extension for them to reuse JVMs. JobTracker (from hadoop 0.20.2) doesn't directly keep track of JVMs. JvmManager creates mapJvmManager that manages map tasks, reduceJvmManager that manages reduce tasks. A quick search over http://hadoop.apache.org/common/docs/current/capacity_scheduler.html doesn't yield anything on JVM reuse either. FYI On Thu, Apr 1, 2010 at 8:38 AM, Vasilis Liaskovitis wrote: > All, > > thanks for your suggestions everyone, these are valuable. > Some comments: > > On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey > wrote: > > On Linux, check out the 'swappiness' OS tunable -- you can turn this down > from the default to reduce swapping at the expense of some system file > cache. > > However, you want a decent chunk of RAM left for the system to cache > files -- if it is all allocated and used by Hadoop there will be extra I/O. > > I have set /proc/sys/vm/swappiness to 1, though I haven't tried 0. > > > For Java GC, if your -Xmx is above 600MB or so, try either changing > -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the > -XX:MaxNewSize parameter to around 150MB to 250MB. > > An example of Hadoop memory use scaling as -Xmx grows: > > Lets say you have a Hadoop job with a 100MB map side join, and 250MB of > hadoop sort space. > > In this example, what hadoop config parameters do the above 2 buffers > refer to? io.sort.mb=250, but which parameter does the "map side join" > 100MB refer to? Are you referring to the split size of the input data > handled by a single map task? Apart from that question, the example is > clear to me and useful, thanks. > > > Now, perhaps due to some other jobs you want to set -Xmx1200M. The above > job will end up using about 150MB more now, because the new space has grown, > although the footprint is the same. A larger new space can improve > performance, but with most typical hadoop jobs it won't. Making sure it > does not grow larger just because -Xmx is larger can help save a lot of > memory. Additionally, a job that would have failed with an OOME at > -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB > instead of 400MB of the space. > > > Indeed for my jobs I haven't noticed better performance going to > XMx900 or 1000. I normally use -XMx700. I haven't tried the > -XX:MaxNewSize or -XX:NewRatio but I will. > > > If you are using a 64 bit JRE, you can also save space with the > -XX:+UseCompressedOops option -- sometimes quite a bit of space. > > > I am using this already, thanks. > > Quoting Allen: "Java takes more RAM than just the heap size. > Sometimes 2-3x as much." > Is there a clear indication that Java memory usage extends so far > beyond its allocated heap? E.g. would java thread stacks really > account for such a big increase 2x to 3x? Tasks seem to be heavily > threaded. What are the relevant config options to control number of > threads within a task? > > "My general rule of thumb for general purpose grids is to plan on having > 3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, > and > task tracker processes. After that, you can carve it up however you want." > > I now have 4-6GB of free space, when taking into account the full heap > space of all child JVMs. Does that sound reasonable for all the other > node needs (file caching, datanode, task tracker)? Having ~1G for > tasktracker and ~1G for datanode leaves 2-4GB for file caching. Even > with this setup on a terasort run of 64GB data across 7 nodes > (separate node for namenode/jobtracker), I run low on memory, though > there is no swapping for the majority of cases. I assume I am running > low on memory mainly due to file/disk caching or thread stacks? Any > other possible reasons? > > With this new setup, I don't normally get swapping for a single job > e.g. terasort or hive job. However, the problem in general is > exacerbated if one spawns multiple indepenendent hadoop jobs > simultaneously. I 've noticed that JVMs are not re-used across jobs, > in an earlier post: > http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html > This implies that Java memory usage would blow up when submitting > multiple independent jobs. So this multiple job scenario sounds more > susceptible to swapping > > A relevant question is: in production environments, do people run jobs > in parallel? Or is it that the majority of jobs is a serial pipeline / > cascade of jobs being run back to back? > > thanks, > > - Vasilis >
Re: swapping on hadoop
All, thanks for your suggestions everyone, these are valuable. Some comments: On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey wrote: > On Linux, check out the 'swappiness' OS tunable -- you can turn this down > from the default to reduce swapping at the expense of some system file cache. > However, you want a decent chunk of RAM left for the system to cache files -- > if it is all allocated and used by Hadoop there will be extra I/O. I have set /proc/sys/vm/swappiness to 1, though I haven't tried 0. > For Java GC, if your -Xmx is above 600MB or so, try either changing > -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the > -XX:MaxNewSize parameter to around 150MB to 250MB. > An example of Hadoop memory use scaling as -Xmx grows: > Lets say you have a Hadoop job with a 100MB map side join, and 250MB of > hadoop sort space. In this example, what hadoop config parameters do the above 2 buffers refer to? io.sort.mb=250, but which parameter does the "map side join" 100MB refer to? Are you referring to the split size of the input data handled by a single map task? Apart from that question, the example is clear to me and useful, thanks. > Now, perhaps due to some other jobs you want to set -Xmx1200M. The above job > will end up using about 150MB more now, because the new space has grown, > although the footprint is the same. A larger new space can improve > performance, but with most typical hadoop jobs it won't. Making sure it > does not grow larger just because -Xmx is larger can help save a lot of > memory. Additionally, a job that would have failed with an OOME at -Xmx1200M > might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB > of the space. > Indeed for my jobs I haven't noticed better performance going to XMx900 or 1000. I normally use -XMx700. I haven't tried the -XX:MaxNewSize or -XX:NewRatio but I will. > If you are using a 64 bit JRE, you can also save space with the > -XX:+UseCompressedOops option -- sometimes quite a bit of space. > I am using this already, thanks. Quoting Allen: "Java takes more RAM than just the heap size. Sometimes 2-3x as much." Is there a clear indication that Java memory usage extends so far beyond its allocated heap? E.g. would java thread stacks really account for such a big increase 2x to 3x? Tasks seem to be heavily threaded. What are the relevant config options to control number of threads within a task? "My general rule of thumb for general purpose grids is to plan on having 3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, and task tracker processes. After that, you can carve it up however you want." I now have 4-6GB of free space, when taking into account the full heap space of all child JVMs. Does that sound reasonable for all the other node needs (file caching, datanode, task tracker)? Having ~1G for tasktracker and ~1G for datanode leaves 2-4GB for file caching. Even with this setup on a terasort run of 64GB data across 7 nodes (separate node for namenode/jobtracker), I run low on memory, though there is no swapping for the majority of cases. I assume I am running low on memory mainly due to file/disk caching or thread stacks? Any other possible reasons? With this new setup, I don't normally get swapping for a single job e.g. terasort or hive job. However, the problem in general is exacerbated if one spawns multiple indepenendent hadoop jobs simultaneously. I 've noticed that JVMs are not re-used across jobs, in an earlier post: http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html This implies that Java memory usage would blow up when submitting multiple independent jobs. So this multiple job scenario sounds more susceptible to swapping A relevant question is: in production environments, do people run jobs in parallel? Or is it that the majority of jobs is a serial pipeline / cascade of jobs being run back to back? thanks, - Vasilis
Re: swapping on hadoop
On Linux, check out the 'swappiness' OS tunable -- you can turn this down from the default to reduce swapping at the expense of some system file cache. However, you want a decent chunk of RAM left for the system to cache files -- if it is all allocated and used by Hadoop there will be extra I/O. For Java GC, if your -Xmx is above 600MB or so, try either changing -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the -XX:MaxNewSize parameter to around 150MB to 250MB. An example of Hadoop memory use scaling as -Xmx grows: Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop sort space. Both of these chunks of data will eventually get pushed to the tenured generation. So, the actual heap required will end up close to: (Size of young generation) + 100MB + 250MB + misc. The default size of the young generation is 1/3 of the heap. So, at -Xmx750M this job will probably use a minimum of 600MB of java heap, plus about 50MB non-heap if this is a pure java job. Now, perhaps due to some other jobs you want to set -Xmx1200M. The above job will end up using about 150MB more now, because the new space has grown, although the footprint is the same. A larger new space can improve performance, but with most typical hadoop jobs it won't. Making sure it does not grow larger just because -Xmx is larger can help save a lot of memory. Additionally, a job that would have failed with an OOME at -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space. If you are using a 64 bit JRE, you can also save space with the -XX:+UseCompressedOops option -- sometimes quite a bit of space. On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote: > Hi all, > > I 've noticed swapping for a single terasort job on a small 8-node > cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I > can have back to back runs of the same job from the same hdfs input > data and get swapping only on 1 out of 4 identical runs. I 've noticed > this swapping behaviour on both terasort jobs and hive query jobs. > > - Focusing on a single job config, Is there a rule of thumb about how > much node memory should be left for use outside of Child JVMs? > I make sure that per Node, there is free memory: > (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * > JVMHeapSize < PhysicalMemoryonNode > The total JVM heap size per node per job from the above equation > currently account 65%-75% of the node's memory. (I 've tried > allocating a riskier 90% of the node's memory, with similar swapping > observations). > > - Could there be an issue with HDFS data or metadata taking up memory? > I am not cleaning output or intermediate outputs from HDFS between > runs. Is this possible? > > - Do people use any specific java flags (particularly garbage > collection flags) for production environments where one job runs (or > possibly more jobs run simultaneously) ? > > - What are the memory requirements for the jobtracker,namenode and > tasktracker,datanode JVMs? > > - I am setting io.sort.mb to about half of the JVM heap size (half of > -Xmx in javaopts). Should this be set to a different ratio? (this > setting doesn't sound like it should be causing swapping in the first > place). > > - The buffer cache is cleaned before each run (flush and echo 3 > > /proc/sys/vm/drop_caches) > > any empirical advice and suggestions to solve this are appreciated. > thanks, > > - Vasilis
Re: swapping on hadoop
On 3/30/10 10:15 AM, "Vasilis Liaskovitis" wrote: > - Focusing on a single job config, Is there a rule of thumb about how > much node memory should be left for use outside of Child JVMs? > I make sure that per Node, there is free memory: > (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * > JVMHeapSize < PhysicalMemoryonNode > The total JVM heap size per node per job from the above equation > currently account 65%-75% of the node's memory. (I 've tried > allocating a riskier 90% of the node's memory, with similar swapping > observations). Java takes more RAM than just the heap size. Sometimes 2-3x as much. My general rule of thumb for general purpose grids is to plan on having 3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, and task tracker processes. After that, you can carve it up however you want. If you are running on machines with less than that much, you likely are going to face at least a little bit of swapping or have minimum monitoring or whatever.
Re: swapping on hadoop
Hi, >>(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < >>PhysicalMemoryonNode The tasktracker and datanode daemons also take up memory, 1GB each by default I think. Is that accounted for? >> Could there be an issue with HDFS data or metadata taking up memory? Is the namenode a separate machine or participates in compute nodes too? >>What are the memory requirements for the jobtracker,namenode and >>tasktracker,datanode JVMs? See above, there was a thread running on this on the forum sometime back, to manipulate these values for TT and DN. >>this setting doesn't sound like it should be causing swapping in the first >>place ( io.sort.mb) I think so too :) Just yesterday I read a tweet on machine configs for Hadoop, hope it helps you http://bit.ly/cphF7R Amogh On 3/30/10 10:45 PM, "Vasilis Liaskovitis" wrote: Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping behaviour on both terasort jobs and hive query jobs. - Focusing on a single job config, Is there a rule of thumb about how much node memory should be left for use outside of Child JVMs? I make sure that per Node, there is free memory: (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < PhysicalMemoryonNode The total JVM heap size per node per job from the above equation currently account 65%-75% of the node's memory. (I 've tried allocating a riskier 90% of the node's memory, with similar swapping observations). - Could there be an issue with HDFS data or metadata taking up memory? I am not cleaning output or intermediate outputs from HDFS between runs. Is this possible? - Do people use any specific java flags (particularly garbage collection flags) for production environments where one job runs (or possibly more jobs run simultaneously) ? - What are the memory requirements for the jobtracker,namenode and tasktracker,datanode JVMs? - I am setting io.sort.mb to about half of the JVM heap size (half of -Xmx in javaopts). Should this be set to a different ratio? (this setting doesn't sound like it should be causing swapping in the first place). - The buffer cache is cleaned before each run (flush and echo 3 > /proc/sys/vm/drop_caches) any empirical advice and suggestions to solve this are appreciated. thanks, - Vasilis
swapping on hadoop
Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping behaviour on both terasort jobs and hive query jobs. - Focusing on a single job config, Is there a rule of thumb about how much node memory should be left for use outside of Child JVMs? I make sure that per Node, there is free memory: (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < PhysicalMemoryonNode The total JVM heap size per node per job from the above equation currently account 65%-75% of the node's memory. (I 've tried allocating a riskier 90% of the node's memory, with similar swapping observations). - Could there be an issue with HDFS data or metadata taking up memory? I am not cleaning output or intermediate outputs from HDFS between runs. Is this possible? - Do people use any specific java flags (particularly garbage collection flags) for production environments where one job runs (or possibly more jobs run simultaneously) ? - What are the memory requirements for the jobtracker,namenode and tasktracker,datanode JVMs? - I am setting io.sort.mb to about half of the JVM heap size (half of -Xmx in javaopts). Should this be set to a different ratio? (this setting doesn't sound like it should be causing swapping in the first place). - The buffer cache is cleaned before each run (flush and echo 3 > /proc/sys/vm/drop_caches) any empirical advice and suggestions to solve this are appreciated. thanks, - Vasilis