Hemanth,

Thank you for the elaborate explanation.
First of all, The total swap memory size is over 4 giga bytes, but the
actual used size around several hundred kilo bytes.
So I guess I can use almost whole 4 giga bytes of physical memory.

The sentence "streaming does not allow enough memory for user's processes to
run" is from the book I study. So I can't say that I exactly understand the
sentence.

Since streaming jobs are not memory intensive I guess I'll start by using
the "-Xms" option.
And maybe dial down the heap size of datanode and tasktracker a little bit.
I'd love it if I could put some more rams into the system but currently that
is out of option so I'll have to do with tweaks and options :)
Thanks again for the answer.

Ed


2010/7/9 Hemanth Yamijala <yhema...@gmail.com>

> Edward,
>
> Overall, I think the consideration should be about how much load do
> you expect to support on your cluster. For HDFS, there's a good amount
> of information about how much RAM is required to support a certain
> amount of data stored in DFS; something similar can be found for
> Map/Reduce as well. There are also a few configuration options to let
> the Jobtracker use lesser memory. I suppose that depending on your
> load, your answer could really have to be "increase the RAM
> configuration" rather than any tweaks of the JVM heap sizes or any
> other configuration. Please do consider that first.
>
> Anyway, some answers to your questions inline:
>
> > Machines in my cluster have relatively small physical memory (4GB)
> >
>
> How much is the swap ? While it is available for use as well, it is
> not advisable, because once the JVM starts to thrash to disk, in our
> experience, it degrades performance rapidly.
>
> > I was wondering if I could reduce the heap size that namenode and
> jobtracker
> > are assigned.
> > The default heap size is 1000MB respectively, and I know that.
> > The thing is, does that 1000MB mean maximum possible memory that
> namenode(or
> > jobtracker) can use?
> > What I mean is that does namenode start with minimum memory and increase
> the
> > memory size all the way up to 1000MB depending on the job status?
> > Or is namenode given 1000MB from the beginning so that there is no
> > flexibility at all?
>
> If you want you can control this using another parameter -Xms set to
> the JVM. This specifies the VM to start with the specified heap size
> and then increase.
>
> > If namenode and jobtracker do start with solid 1000MB then I would have
> to
> > dial them down to several hundreds of mega byte since I only 4GB of
> memory.
> > 2giga bytes of memory taken up just by namenode and jobtracker is too
> much
> > an expense for me.
> >
> > My question also applies to heap size of child JVM. I know that they are
> > originally given 200MB of heap size.
> > I intend to increase the heap size to 512MB, but if the heap size
> allocation
> > has no flexibility then I'd have to maintain the 200MB configuration.
> > Take out the 2GB (used by namenode and jobtracker) from the total 4GB, I
> can
> > have only 4 map/reduce tasks with 512MB configuration and since I have
> quad
> > core CPU this would be a waste.
> >
>
> Please also take into account datanodes/tasktrackers and the OS itself.
>
> > Oh, and one last thing.
> > I am using Hadoop streaming.
> > I read from a book that when you are using hadoop streaming, you should
> > allocate less heap size to child JVM. (I am not sure if it meant less
> than
> > 200MB or less than 400MB)
> > Because streaming does not allow enough memory for user's processes to
> run.
> > So what is the optimal heap size for map/reduce tasks in hadoop
> streaming?
> > My plan was to increase the heap size of the child JVM to 512MB.
> > But if what the book says is true, there is no point.
> >
>
> I think the intent is to say that when you are using Streaming, the
> Child task is not really memory intensive as all the work is going to
> be done by the streaming executable and so you can experiment with
> much lower values than if you want to run pure Java M/R tasks. I am
> not sure what you mean by "streaming does not allow enough memory for
> user's processes to run".
>
> Thanks
> hemanth
>

Reply via email to