RE: Reusing jobs

Devaraj Das Fri, 18 Apr 2008 07:11:31 -0700

Jason, didn't get that. The jvm should exit naturally even without calling
System.exit. Where exactly did you insert the System.exit?  Please clarify.
Thanks!


> -----Original Message-----
> From: Jason Venner [mailto:[EMAIL PROTECTED] 
> Sent: Friday, April 18, 2008 6:48 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Reusing jobs
> 
> We have terrible issues with threads in the JVM's holding 
> down resources and causing the compute nodes to run out of 
> memory and lock up. We in fact patch the JobTracker to cause 
> the mapper/reduce jvm to System.exit, to ensure that the 
> resources are freed.
> 
> This is particularly a problem for mapper/reducers that 
> enable jmx or spool off many threads for internal processing.
> 
> Our solution is to tune the input split size so that the 
> minimum mapper time is > 1 minute
> 
> Karl Wettin wrote:
> > Ted Dunning skrev:
> >> Hadoop has enormous startup costs that are relatively 
> inherent in the 
> >> current design.
> >>
> >> Most notably, mappers and reducers are executed in a 
> standalone JVM 
> >> (ostensibly for safety reasons).
> >
> > Is it possible to hack in support to reuse JVMs? Keep it 
> alive until 
> > timed out and have it execute the jobs by opening a socket and say 
> > hello? What classes should I start looking in? Could be a 
> fun exercise.
> >
> >
> >           karl
> >
> >
> >
> >>
> >>
> >>
> >> On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:
> >>
> >>> Is it possible to execute a job more than once?
> >>>
> >>> I use map reduce when adding a new instance to a 
> hierarchial cluster 
> >>> tree. It finds the least distant node and inserts the new 
> instance 
> >>> as a sibling to that node.
> >>>
> >>> As far as I know it is in very the nature of this 
> algorithm that one 
> >>> inserts one instance at a time, that this is how the second 
> >>> dimension is created that makes it better than a vector 
> cluster. It 
> >>> would be possible to map all permutations of instances 
> and skip the 
> >>> reduction, but that would result in many more calulations than 
> >>> iteratively training the tree as the latter only require 
> one to test 
> >>> against the instances already inserted to the tree.
> >>>
> >>> Iteratively training this tree using Hadoop means 
> executing one job 
> >>> per instance that measure distance to all instances in a 
> file that I 
> >>> also append the new instance to once inserted in the tree.
> >>>
> >>> All of above is very inefficient, especially with a young 
> tree that 
> >>> could be trained in nanoseconds locally. So I do that 
> until it takes 
> >>> 20 seconds to insert an instance.
> >>>
> >>> But really, this is all Hadoop framework overhead. I'm not quite 
> >>> sure of all it does when I execute a job, but it seems 
> like quite a 
> >>> lot. And all I'm doing is executing a couple of identical 
> jobs over 
> >>> and over again using new data.
> >>>
> >>> It would be very nice if I it just took a few 
> milliseconds to do that.
> >>>
> >>>
> >>>        karl
> >>
> >
>

RE: Reusing jobs

Reply via email to