Hemanth, Thank you for the elaborate explanation. First of all, The total swap memory size is over 4 giga bytes, but the actual used size around several hundred kilo bytes. So I guess I can use almost whole 4 giga bytes of physical memory.
The sentence "streaming does not allow enough memory for user's processes to run" is from the book I study. So I can't say that I exactly understand the sentence. Since streaming jobs are not memory intensive I guess I'll start by using the "-Xms" option. And maybe dial down the heap size of datanode and tasktracker a little bit. I'd love it if I could put some more rams into the system but currently that is out of option so I'll have to do with tweaks and options :) Thanks again for the answer. Ed 2010/7/9 Hemanth Yamijala <yhema...@gmail.com> > Edward, > > Overall, I think the consideration should be about how much load do > you expect to support on your cluster. For HDFS, there's a good amount > of information about how much RAM is required to support a certain > amount of data stored in DFS; something similar can be found for > Map/Reduce as well. There are also a few configuration options to let > the Jobtracker use lesser memory. I suppose that depending on your > load, your answer could really have to be "increase the RAM > configuration" rather than any tweaks of the JVM heap sizes or any > other configuration. Please do consider that first. > > Anyway, some answers to your questions inline: > > > Machines in my cluster have relatively small physical memory (4GB) > > > > How much is the swap ? While it is available for use as well, it is > not advisable, because once the JVM starts to thrash to disk, in our > experience, it degrades performance rapidly. > > > I was wondering if I could reduce the heap size that namenode and > jobtracker > > are assigned. > > The default heap size is 1000MB respectively, and I know that. > > The thing is, does that 1000MB mean maximum possible memory that > namenode(or > > jobtracker) can use? > > What I mean is that does namenode start with minimum memory and increase > the > > memory size all the way up to 1000MB depending on the job status? > > Or is namenode given 1000MB from the beginning so that there is no > > flexibility at all? > > If you want you can control this using another parameter -Xms set to > the JVM. This specifies the VM to start with the specified heap size > and then increase. > > > If namenode and jobtracker do start with solid 1000MB then I would have > to > > dial them down to several hundreds of mega byte since I only 4GB of > memory. > > 2giga bytes of memory taken up just by namenode and jobtracker is too > much > > an expense for me. > > > > My question also applies to heap size of child JVM. I know that they are > > originally given 200MB of heap size. > > I intend to increase the heap size to 512MB, but if the heap size > allocation > > has no flexibility then I'd have to maintain the 200MB configuration. > > Take out the 2GB (used by namenode and jobtracker) from the total 4GB, I > can > > have only 4 map/reduce tasks with 512MB configuration and since I have > quad > > core CPU this would be a waste. > > > > Please also take into account datanodes/tasktrackers and the OS itself. > > > Oh, and one last thing. > > I am using Hadoop streaming. > > I read from a book that when you are using hadoop streaming, you should > > allocate less heap size to child JVM. (I am not sure if it meant less > than > > 200MB or less than 400MB) > > Because streaming does not allow enough memory for user's processes to > run. > > So what is the optimal heap size for map/reduce tasks in hadoop > streaming? > > My plan was to increase the heap size of the child JVM to 512MB. > > But if what the book says is true, there is no point. > > > > I think the intent is to say that when you are using Streaming, the > Child task is not really memory intensive as all the work is going to > be done by the streaming executable and so you can experiment with > much lower values than if you want to run pure Java M/R tasks. I am > not sure what you mean by "streaming does not allow enough memory for > user's processes to run". > > Thanks > hemanth >