Geoffry,

 Hadoop distributedCache (as of now) is used to "cache" M/R application 
specific files.
 These files are used by M/R app only and not the framework. (Normally as 
side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within 
the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that 
concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is 
a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not 
starting the same task)

 Anyone has a better solution for Geoffry?



-----Original Message-----
From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com] 
Sent: Friday, March 02, 2012 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is
apparently deprecated.  Is there a replacement?  I didn't see anything about 
this in the documentation, but then I am still using 0.21.0. I have to for 
performance reasons.  1.0.1 is too slow and the client won't have it.

Also, the 
DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a 
Reducer, but not from within a Driver.  I have libraries that I must access 
both from both places.  I take it that I am stuck keeping two copies of these 
libraries in synch--Correct?  It's either that, or copy them into hdfs, 
replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley <omal...@apache.org> wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts 
> <geoffry.robe...@gmail.com> wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the 
> machines that run the task, but it is better in most cases to use the 
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> ibutedCache
>
> > If I specify but one reducer, which node in the cluster will the 
> > reducer run on?
>
> The scheduling is done by the JobTracker and it isn't possible to 
> control the location of the reducers.
>
> -- Owen
>



--
Geoffry Roberts

Reply via email to