Torsten Curdt wrote:

Being a complete idiot for distributed computing, I would say it is easy to explode a JVM when doing such distributed jobs, (should it be for OOM or anything).

Then restrict what people can do - at least Google went that route.

I don't know what Google did on the specifics :)

If you want to do that with Java and restrict memory usage, cpu usage and descriptor access within each inVM instance. That's a considerable amount of work that likely implies writing a specific agent for the vm (or an agent for a specific vm that is, because it's pretty unlikely that you will get the same results across vms), assuming that can then really be done at the classloader level for each task (which is pretty insanely complex to me if you have to consider allocation done at the parent classloader level, etc..)

At least by forking a vm you can afford to get some reasonably bound control over the resources usage (or at least memory) without bringing down everything since a vm is already bound to some degrees.


Failing jobs are not exactly uncommon and running things in a sandboxed environment with less risk for the tracker seems like a perfectly reasonable choice. So yeah, vm pooling certainly makes perfect sense for it

I am still not convinced - sorry

It's a bit like you would like to run JSPs in a separate JVM because they might take down the servlet container.

it is a bit too extreme in granularity. I think it is more about like running n different webapps within the same VM or not. So if one webapp is resource hog, separating it would not harm the n-1 other applications and you would either create another server instance or move it away to another node.

I know of environment with large number of nodes (not related to hadoop) where they also reboot a set of nodes daily to ensure that all machines are really in working conditions (it's usually when the machine reboots due to failure or whatever that someone has to rush to it because some service forgot to be registered or things like that, so doing this periodic check gives some people better ideas of their response time to failure). That depends of operational procedures for sure.

I don't think it should be done in the spirit that everything is perfect in the perfect world because we know it is not like that. So there will be compromise between safety and performance and having something reasonably tolerant to failure is also a performance advantage.

Doing simple things in a task like a deleteOnExit is enough to leak on some VMs a few kbs each time and stay there until the vm dies (fixed in 1.5.0_10 if I remember well). Figuring out things like that in the end is likely to take a severe amount of time considering it is an internal leak and will not appear in your favorite java profiler either.

Bottom line is that even if you're 100% sure of your code which is quite unlikely (at least for me as far as I'm concerned ), you don't know third-party code. So without being totally paranoid, this is something that cannot be ignored.

-- stephane

Reply via email to