Re: Reusing jobs

2008-04-18 Thread Jason Venner
When there are non daemon threads, JMX threads being our #1 cause, the 
jvm will not exit with out help.


This is in TaskTracker.java,

in 0.16.0, this is line 2088, in the finally clause of Child.main

   LogManager.shutdown();
   System.exit( 0 );   // Force the jvm to exit even if it has 
threads still running, this prevents memory expensive jvms being left around



Devaraj Das wrote:

Jason, didn't get that. The jvm should exit naturally even without calling
System.exit. Where exactly did you insert the System.exit?  Please clarify.
Thanks! 

  

-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 18, 2008 6:48 PM

To: core-user@hadoop.apache.org
Subject: Re: Reusing jobs

We have terrible issues with threads in the JVM's holding 
down resources and causing the compute nodes to run out of 
memory and lock up. We in fact patch the JobTracker to cause 
the mapper/reduce jvm to System.exit, to ensure that the 
resources are freed.


This is particularly a problem for mapper/reducers that 
enable jmx or spool off many threads for internal processing.


Our solution is to tune the input split size so that the 
minimum mapper time is > 1 minute


Karl Wettin wrote:


Ted Dunning skrev:
  
Hadoop has enormous startup costs that are relatively 

inherent in the 


current design.

Most notably, mappers and reducers are executed in a 

standalone JVM 


(ostensibly for safety reasons).

Is it possible to hack in support to reuse JVMs? Keep it 
  
alive until 

timed out and have it execute the jobs by opening a socket and say 
hello? What classes should I start looking in? Could be a 
  

fun exercise.


  karl



  


On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:



Is it possible to execute a job more than once?

I use map reduce when adding a new instance to a 
  
hierarchial cluster 

tree. It finds the least distant node and inserts the new 
  
instance 


as a sibling to that node.

As far as I know it is in very the nature of this 
  
algorithm that one 

inserts one instance at a time, that this is how the second 
dimension is created that makes it better than a vector 
  
cluster. It 

would be possible to map all permutations of instances 
  
and skip the 

reduction, but that would result in many more calulations than 
iteratively training the tree as the latter only require 
  
one to test 


against the instances already inserted to the tree.

Iteratively training this tree using Hadoop means 
  
executing one job 

per instance that measure distance to all instances in a 
  
file that I 


also append the new instance to once inserted in the tree.

All of above is very inefficient, especially with a young 
  
tree that 

could be trained in nanoseconds locally. So I do that 
  
until it takes 


20 seconds to insert an instance.

But really, this is all Hadoop framework overhead. I'm not quite 
sure of all it does when I execute a job, but it seems 
  
like quite a 

lot. And all I'm doing is executing a couple of identical 
  
jobs over 


and over again using new data.

It would be very nice if I it just took a few 
  

milliseconds to do that.


   karl
  


  

--
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested


RE: Reusing jobs

2008-04-18 Thread Devaraj Das
Jason, didn't get that. The jvm should exit naturally even without calling
System.exit. Where exactly did you insert the System.exit?  Please clarify.
Thanks! 

> -Original Message-
> From: Jason Venner [mailto:[EMAIL PROTECTED] 
> Sent: Friday, April 18, 2008 6:48 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Reusing jobs
> 
> We have terrible issues with threads in the JVM's holding 
> down resources and causing the compute nodes to run out of 
> memory and lock up. We in fact patch the JobTracker to cause 
> the mapper/reduce jvm to System.exit, to ensure that the 
> resources are freed.
> 
> This is particularly a problem for mapper/reducers that 
> enable jmx or spool off many threads for internal processing.
> 
> Our solution is to tune the input split size so that the 
> minimum mapper time is > 1 minute
> 
> Karl Wettin wrote:
> > Ted Dunning skrev:
> >> Hadoop has enormous startup costs that are relatively 
> inherent in the 
> >> current design.
> >>
> >> Most notably, mappers and reducers are executed in a 
> standalone JVM 
> >> (ostensibly for safety reasons).
> >
> > Is it possible to hack in support to reuse JVMs? Keep it 
> alive until 
> > timed out and have it execute the jobs by opening a socket and say 
> > hello? What classes should I start looking in? Could be a 
> fun exercise.
> >
> >
> >   karl
> >
> >
> >
> >>
> >>
> >>
> >> On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:
> >>
> >>> Is it possible to execute a job more than once?
> >>>
> >>> I use map reduce when adding a new instance to a 
> hierarchial cluster 
> >>> tree. It finds the least distant node and inserts the new 
> instance 
> >>> as a sibling to that node.
> >>>
> >>> As far as I know it is in very the nature of this 
> algorithm that one 
> >>> inserts one instance at a time, that this is how the second 
> >>> dimension is created that makes it better than a vector 
> cluster. It 
> >>> would be possible to map all permutations of instances 
> and skip the 
> >>> reduction, but that would result in many more calulations than 
> >>> iteratively training the tree as the latter only require 
> one to test 
> >>> against the instances already inserted to the tree.
> >>>
> >>> Iteratively training this tree using Hadoop means 
> executing one job 
> >>> per instance that measure distance to all instances in a 
> file that I 
> >>> also append the new instance to once inserted in the tree.
> >>>
> >>> All of above is very inefficient, especially with a young 
> tree that 
> >>> could be trained in nanoseconds locally. So I do that 
> until it takes 
> >>> 20 seconds to insert an instance.
> >>>
> >>> But really, this is all Hadoop framework overhead. I'm not quite 
> >>> sure of all it does when I execute a job, but it seems 
> like quite a 
> >>> lot. And all I'm doing is executing a couple of identical 
> jobs over 
> >>> and over again using new data.
> >>>
> >>> It would be very nice if I it just took a few 
> milliseconds to do that.
> >>>
> >>>
> >>>karl
> >>
> >
> 



Re: Reusing jobs

2008-04-18 Thread Jason Venner
We have terrible issues with threads in the JVM's holding down resources 
and causing the compute nodes to run out of memory and lock up. We in 
fact patch the JobTracker to cause the mapper/reduce jvm to System.exit, 
to ensure that the resources are freed.


This is particularly a problem for mapper/reducers that enable jmx or 
spool off many threads for internal processing.


Our solution is to tune the input split size so that the minimum mapper 
time is > 1 minute


Karl Wettin wrote:

Ted Dunning skrev:

Hadoop has enormous startup costs that are relatively inherent in the
current design.

Most notably, mappers and reducers are executed in a standalone JVM
(ostensibly for safety reasons).


Is it possible to hack in support to reuse JVMs? Keep it alive until 
timed out and have it execute the jobs by opening a socket and say 
hello? What classes should I start looking in? Could be a fun exercise.



  karl







On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:


Is it possible to execute a job more than once?

I use map reduce when adding a new instance to a hierarchial cluster
tree. It finds the least distant node and inserts the new instance as a
sibling to that node.

As far as I know it is in very the nature of this algorithm that one
inserts one instance at a time, that this is how the second 
dimension is
created that makes it better than a vector cluster. It would be 
possible

to map all permutations of instances and skip the reduction, but that
would result in many more calulations than iteratively training the 
tree

as the latter only require one to test against the instances already
inserted to the tree.

Iteratively training this tree using Hadoop means executing one job per
instance that measure distance to all instances in a file that I also
append the new instance to once inserted in the tree.

All of above is very inefficient, especially with a young tree that
could be trained in nanoseconds locally. So I do that until it takes 20
seconds to insert an instance.

But really, this is all Hadoop framework overhead. I'm not quite 
sure of
all it does when I execute a job, but it seems like quite a lot. And 
all

I'm doing is executing a couple of identical jobs over and over again
using new data.

It would be very nice if I it just took a few milliseconds to do that.


   karl






Re: Reusing jobs

2008-04-17 Thread Spiros Papadimitriou
Hi --

Not really sure that JVM startup is the main overhead -- you could take a
look at the logfiles of the individual TIPs and compare the timestamp of the
first log message to the time the jobtracker reports that TIP was started.
In my experience, that is well under a second (once the cluster has warmed
up), but please do correct me if I'm wrong -- I'd really be interested to
know what others observe.

BTW, some very rough benchmarks on something similar:
  http://www.cs.cmu.edu/~spapadim/hadoop/timeline.html

The last plot shows executing the job locally (with a chunk size of 128MB)
vs a hand-coded C++ program -- both do a simple regex match and then
construct a histogram of counts of the matched strings.  The overhead is
impressively small -- I'm assuming that local execution of a Hadoop job will
still fire up a separate JVM for each map chunk (I didn't double-check
this).

Cheers,
Spiros

On Thu, Apr 17, 2008 at 10:43 PM, Karl Wettin <[EMAIL PROTECTED]> wrote:

> Ted Dunning skrev:
>
> > Hadoop has enormous startup costs that are relatively inherent in the
> > current design.
> >
> > Most notably, mappers and reducers are executed in a standalone JVM
> > (ostensibly for safety reasons).
> >
>
> Is it possible to hack in support to reuse JVMs? Keep it alive until timed
> out and have it execute the jobs by opening a socket and say hello? What
> classes should I start looking in? Could be a fun exercise.
>
>
>  karl
>
>
>
>
>
> >
> >
> > On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:
> >
> >  Is it possible to execute a job more than once?
> > >
> > > I use map reduce when adding a new instance to a hierarchial cluster
> > > tree. It finds the least distant node and inserts the new instance as
> > > a
> > > sibling to that node.
> > >
> > > As far as I know it is in very the nature of this algorithm that one
> > > inserts one instance at a time, that this is how the second dimension
> > > is
> > > created that makes it better than a vector cluster. It would be
> > > possible
> > > to map all permutations of instances and skip the reduction, but that
> > > would result in many more calulations than iteratively training the
> > > tree
> > > as the latter only require one to test against the instances already
> > > inserted to the tree.
> > >
> > > Iteratively training this tree using Hadoop means executing one job
> > > per
> > > instance that measure distance to all instances in a file that I also
> > > append the new instance to once inserted in the tree.
> > >
> > > All of above is very inefficient, especially with a young tree that
> > > could be trained in nanoseconds locally. So I do that until it takes
> > > 20
> > > seconds to insert an instance.
> > >
> > > But really, this is all Hadoop framework overhead. I'm not quite sure
> > > of
> > > all it does when I execute a job, but it seems like quite a lot. And
> > > all
> > > I'm doing is executing a couple of identical jobs over and over again
> > > using new data.
> > >
> > > It would be very nice if I it just took a few milliseconds to do that.
> > >
> > >
> > >   karl
> > >
> >
> >
>


Re: Reusing jobs

2008-04-17 Thread Karl Wettin

Ted Dunning skrev:

Hadoop has enormous startup costs that are relatively inherent in the
current design.

Most notably, mappers and reducers are executed in a standalone JVM
(ostensibly for safety reasons).


Is it possible to hack in support to reuse JVMs? Keep it alive until 
timed out and have it execute the jobs by opening a socket and say 
hello? What classes should I start looking in? Could be a fun exercise.



  karl







On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:


Is it possible to execute a job more than once?

I use map reduce when adding a new instance to a hierarchial cluster
tree. It finds the least distant node and inserts the new instance as a
sibling to that node.

As far as I know it is in very the nature of this algorithm that one
inserts one instance at a time, that this is how the second dimension is
created that makes it better than a vector cluster. It would be possible
to map all permutations of instances and skip the reduction, but that
would result in many more calulations than iteratively training the tree
as the latter only require one to test against the instances already
inserted to the tree.

Iteratively training this tree using Hadoop means executing one job per
instance that measure distance to all instances in a file that I also
append the new instance to once inserted in the tree.

All of above is very inefficient, especially with a young tree that
could be trained in nanoseconds locally. So I do that until it takes 20
seconds to insert an instance.

But really, this is all Hadoop framework overhead. I'm not quite sure of
all it does when I execute a job, but it seems like quite a lot. And all
I'm doing is executing a couple of identical jobs over and over again
using new data.

It would be very nice if I it just took a few milliseconds to do that.


   karl






Re: Reusing jobs

2008-04-17 Thread Ted Dunning

Hadoop has enormous startup costs that are relatively inherent in the
current design.

Most notably, mappers and reducers are executed in a standalone JVM
(ostensibly for safety reasons).



On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:

> Is it possible to execute a job more than once?
> 
> I use map reduce when adding a new instance to a hierarchial cluster
> tree. It finds the least distant node and inserts the new instance as a
> sibling to that node.
> 
> As far as I know it is in very the nature of this algorithm that one
> inserts one instance at a time, that this is how the second dimension is
> created that makes it better than a vector cluster. It would be possible
> to map all permutations of instances and skip the reduction, but that
> would result in many more calulations than iteratively training the tree
> as the latter only require one to test against the instances already
> inserted to the tree.
> 
> Iteratively training this tree using Hadoop means executing one job per
> instance that measure distance to all instances in a file that I also
> append the new instance to once inserted in the tree.
> 
> All of above is very inefficient, especially with a young tree that
> could be trained in nanoseconds locally. So I do that until it takes 20
> seconds to insert an instance.
> 
> But really, this is all Hadoop framework overhead. I'm not quite sure of
> all it does when I execute a job, but it seems like quite a lot. And all
> I'm doing is executing a couple of identical jobs over and over again
> using new data.
> 
> It would be very nice if I it just took a few milliseconds to do that.
> 
> 
>karl