Hi, Is it possible (and does it make sense) to reuse JVMs across jobs?
The job.reuse.jvm.num.tasks config option is a job specific parameter, as its name implies. When running multiple independent jobs simultaneously with job.reuse.jvm=-1 (this means always reuse), I see a lot of different Java PIDs (far more than map.tasks.maximum + reduce.tasks.maximum) across the duration of the job runs, instead of the same Java processes persisting. The number of live JVMs on a given node/tasktracker at any time never exceeds map.tasks.maximum + reduce.tasks.maximum, as expected, but we do tear down idle JVMs and spawn new ones quite often. for example, here are the number of distinct Java PIDs when submitting 1, 4, 32 copies of the same job in parallel: 1 28 2 39 4 106 32 740 The relevant killing and spawing logic should be in src/mapred/org/mapred/org/apache/hadoop/mapred/JvmManager.java, particularly the reapJvm() method, but I haven't dug deeper. I am wondering if it would be possible and worthwhile from a performance standpoint to be able to reuse JVMs across jobs i.e. have a common JVM pool for all hadoop jobs? thanks, - Vasilis