Re: Reusing jobs

Jason Venner Fri, 18 Apr 2008 06:18:12 -0700

We have terrible issues with threads in the JVM's holding down resourcesand causing the compute nodes to run out of memory and lock up. We infact patch the JobTracker to cause the mapper/reduce jvm to System.exit,to ensure that the resources are freed.

This is particularly a problem for mapper/reducers that enable jmx orspool off many threads for internal processing.

Our solution is to tune the input split size so that the minimum mappertime is > 1 minute


Karl Wettin wrote:

Ted Dunning skrev:

Hadoop has enormous startup costs that are relatively inherent in the
current design.

Most notably, mappers and reducers are executed in a standalone JVM
(ostensibly for safety reasons).

Is it possible to hack in support to reuse JVMs? Keep it alive untiltimed out and have it execute the jobs by opening a socket and sayhello? What classes should I start looking in? Could be a fun exercise.



          karl




On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:

Is it possible to execute a job more than once?

I use map reduce when adding a new instance to a hierarchial cluster
tree. It finds the least distant node and inserts the new instance as a
sibling to that node.

As far as I know it is in very the nature of this algorithm that one

inserts one instance at a time, that this is how the seconddimension iscreated that makes it better than a vector cluster. It would bepossible

to map all permutations of instances and skip the reduction, but that

would result in many more calulations than iteratively training thetree

as the latter only require one to test against the instances already
inserted to the tree.

Iteratively training this tree using Hadoop means executing one job per
instance that measure distance to all instances in a file that I also
append the new instance to once inserted in the tree.

All of above is very inefficient, especially with a young tree that
could be trained in nanoseconds locally. So I do that until it takes 20
seconds to insert an instance.

But really, this is all Hadoop framework overhead. I'm not quitesure ofall it does when I execute a job, but it seems like quite a lot. Andall

I'm doing is executing a couple of identical jobs over and over again
using new data.

It would be very nice if I it just took a few milliseconds to do that.


       karl

Re: Reusing jobs

Reply via email to