Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
This is a tardy response.  I'm spread pretty thinly right now.

DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything
about this in the documentation, but then I am still using 0.21.0. I have
to for performance reasons.  1.0.1 is too slow and the client won't have
it.

Also, the 
DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a
Mapper or a Reducer, but not from within a Driver.  I have libraries that I
must access both from both places.  I take it that I am stuck keeping two
copies of these libraries in synch--Correct?  It's either that, or copy
them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
 geoffry.robe...@gmail.com wrote:

  If I create an executable jar file that contains all dependencies
 required
  by the MR job do all said dependencies get distributed to all nodes?

 You can make a single jar and that will be distributed to all of the
 machines that run the task, but it is better in most cases to use the
 distributed cache.

 See
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

  If I specify but one reducer, which node in the cluster will the reducer
  run on?

 The scheduling is done by the JobTracker and it isn't possible to
 control the location of the reducers.

 -- Owen




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Kunaal
Are you looking to use DistributedCache for better performance?

On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
geoffry.robe...@gmail.comwrote:

 This is a tardy response.  I'm spread pretty thinly right now.

 DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 is
 apparently deprecated.  Is there a replacement?  I didn't see anything
 about this in the documentation, but then I am still using 0.21.0. I have
 to for performance reasons.  1.0.1 is too slow and the client won't have
 it.

 Also, the DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 approach
 seems only to work from within a hadoop job.  i.e. From within a
 Mapper or a Reducer, but not from within a Driver.  I have libraries that I
 must access both from both places.  I take it that I am stuck keeping two
 copies of these libraries in synch--Correct?  It's either that, or copy
 them into hdfs, replacing them all at the beginning of each job run.

 Looking for best practices.

 Thanks

 On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  geoffry.robe...@gmail.com wrote:
 
   If I create an executable jar file that contains all dependencies
  required
   by the MR job do all said dependencies get distributed to all nodes?
 
  You can make a single jar and that will be distributed to all of the
  machines that run the task, but it is better in most cases to use the
  distributed cache.
 
  See
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 
   If I specify but one reducer, which node in the cluster will the
 reducer
   run on?
 
  The scheduling is done by the JobTracker and it isn't possible to
  control the location of the reducers.
 
  -- Owen
 



 --
 Geoffry Roberts




-- 
What we are is the universe's gift to us.
What we become is our gift to the universe.


RE: Hadoop and Hibernate

2012-03-02 Thread Leo Leung
Geoffry,

 Hadoop distributedCache (as of now) is used to cache M/R application 
specific files.
 These files are used by M/R app only and not the framework. (Normally as 
side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within 
the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that 
concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is 
a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not 
starting the same task)

 Anyone has a better solution for Geoffry?



-Original Message-
From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com] 
Sent: Friday, March 02, 2012 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything about 
this in the documentation, but then I am still using 0.21.0. I have to for 
performance reasons.  1.0.1 is too slow and the client won't have it.

Also, the 
DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a 
Reducer, but not from within a Driver.  I have libraries that I must access 
both from both places.  I take it that I am stuck keeping two copies of these 
libraries in synch--Correct?  It's either that, or copy them into hdfs, 
replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts 
 geoffry.robe...@gmail.com wrote:

  If I create an executable jar file that contains all dependencies
 required
  by the MR job do all said dependencies get distributed to all nodes?

 You can make a single jar and that will be distributed to all of the 
 machines that run the task, but it is better in most cases to use the 
 distributed cache.

 See
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
 ibutedCache

  If I specify but one reducer, which node in the cluster will the 
  reducer run on?

 The scheduling is done by the JobTracker and it isn't possible to 
 control the location of the reducers.

 -- Owen




--
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
No, I am using 0.21.0 for better performance.  I am interested in
DistributedCache so certain libraries can be found during MR processing.
As it is now, I'm getting ClassNotFoundException being thrown by the
Reducers.  The Driver throws no error, the Reducer(s) does.  It would seem
something is not being distributed across the cluster as I assumed it
would.  After all, the whole business is in a single, executable jar file.

On 2 March 2012 09:46, Kunaal kunaalbha...@gmail.com wrote:

 Are you looking to use DistributedCache for better performance?

 On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
 geoffry.robe...@gmail.comwrote:

  This is a tardy response.  I'm spread pretty thinly right now.
 
  DistributedCache
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  is
  apparently deprecated.  Is there a replacement?  I didn't see anything
  about this in the documentation, but then I am still using 0.21.0. I have
  to for performance reasons.  1.0.1 is too slow and the client won't have
  it.
 
  Also, the DistributedCache
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  approach
  seems only to work from within a hadoop job.  i.e. From within a
  Mapper or a Reducer, but not from within a Driver.  I have libraries
 that I
  must access both from both places.  I take it that I am stuck keeping two
  copies of these libraries in synch--Correct?  It's either that, or copy
  them into hdfs, replacing them all at the beginning of each job run.
 
  Looking for best practices.
 
  Thanks
 
  On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:
 
   On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
   geoffry.robe...@gmail.com wrote:
  
If I create an executable jar file that contains all dependencies
   required
by the MR job do all said dependencies get distributed to all nodes?
  
   You can make a single jar and that will be distributed to all of the
   machines that run the task, but it is better in most cases to use the
   distributed cache.
  
   See
  
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  
If I specify but one reducer, which node in the cluster will the
  reducer
run on?
  
   The scheduling is done by the JobTracker and it isn't possible to
   control the location of the reducers.
  
   -- Owen
  
 
 
 
  --
  Geoffry Roberts
 



 --
 What we are is the universe's gift to us.
 What we become is our gift to the universe.




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Thanks Leo.  I appreciate your response.

Let me explain my situation more precisely.

I am running a series of MR sub-jobs all harnessed together so they run as
a single job.  The last MR sub-job does nothing more than aggregate the
output of the previous sub-job into a single file(s).  It does this, by
having but a single reducer.  I could eliminate this aggregation sub-job if
I could have the aforementioned previous sub-job insert its output into a
database instead of hdfs.  Doing this, would also eliminate my current
dependance on MultipleOutputs.

The trouble comes when the Reducer(s) cannot find the persistent objects
hence the dreaded CNFE.  I find this odd because they are in the same
package as the Reducer.

Your comment about the back end crying is duly noted.

btw,
MPI = Message Passing Interface?

On 2 March 2012 10:30, Leo Leung
 lle...@ddn.com wrote:

 Geoffry,

  Hadoop distributedCache (as of now) is used to cache M/R application
 specific files.
  These files are used by M/R app only and not the framework. (Normally as
 side-lookup)

  You can certainly try to use Hibernate to query your SQL based back-end
 within the M/R code.
  But think of what happens when a few hundred or thousands of M/R task do
 that concurrently.
  Your back-end is going to cry. (if it can - before it dies)

  So IMO,  prep your M/R job with distributedCache files (pull it down
 first) is a better approach.

  Also, MPI is pretty much out of question (not baked into the framework).
  You'll likely have to roll your own.  (And try to trick the JobTracker in
 not starting the same task)

  Anyone has a better solution for Geoffry?



 -Original Message-
 From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com]
 Sent: Friday, March 02, 2012 9:42 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop and Hibernate

 This is a tardy response.  I'm spread pretty thinly right now.

 DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 is
 apparently deprecated.  Is there a replacement?  I didn't see anything
 about this in the documentation, but then I am still using 0.21.0. I have
 to for performance reasons.  1.0.1 is too slow and the client won't have it.

 Also, the DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 approach
 seems only to work from within a hadoop job.  i.e. From within a Mapper or
 a Reducer, but not from within a Driver.  I have libraries that I must
 access both from both places.  I take it that I am stuck keeping two copies
 of these libraries in synch--Correct?  It's either that, or copy them into
 hdfs, replacing them all at the beginning of each job run.

 Looking for best practices.

 Thanks

 On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  geoffry.robe...@gmail.com wrote:
 
   If I create an executable jar file that contains all dependencies
  required
   by the MR job do all said dependencies get distributed to all nodes?
 
  You can make a single jar and that will be distributed to all of the
  machines that run the task, but it is better in most cases to use the
  distributed cache.
 
  See
  http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
  ibutedCache
 
   If I specify but one reducer, which node in the cluster will the
   reducer run on?
 
  The scheduling is done by the JobTracker and it isn't possible to
  control the location of the reducers.
 
  -- Owen
 



 --
 Geoffry Roberts




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Queries are nothing but inserts.  Create an object, populated it, persist
it. If it worked, life would be good right now.

I've considered JDBC and may yet take that approach.

re: Hibernate outside of Spring -- I'm getting tired already.

Interesting thing:  I use EMF (Eclipse Modelling Framework).  The
supporting jar files for emf and ecore are built into the job.  They are
being found by the Driver(s) and the MR(s) no problemo.  If these work, why
not the hibernate stuff?  Mystery!

On 2 March 2012 10:50, Tarjei Huse tar...@scanmine.com wrote:

 On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
  No, I am using 0.21.0 for better performance.  I am interested in
  DistributedCache so certain libraries can be found during MR processing.
  As it is now, I'm getting ClassNotFoundException being thrown by the
  Reducers.  The Driver throws no error, the Reducer(s) does.  It would
 seem
  something is not being distributed across the cluster as I assumed it
  would.  After all, the whole business is in a single, executable jar
 file.

 How complex are the queries you are doing?

 Have you considered one of the following:

 1) Use plain jdbc instead of integrating Hibernate into Hadoop.
 2) Create a local version of the db that can be in the Distributed Cache.

 I tried using Hibernate with hadoop (the queries were not an important
 part of the size of the jobs) but I ran up against so many issues trying
 to get Hibernate to start up within the MR job that i ended up just
 exporting the tables, loading them into memory and doing queries against
 them with basic HashMap lookups.

 My best advice is that if you can, you should consider a way to abstract
 away Hibernate from the job and use something closer to the metal like
 either JDBC or just dump the data to files. Getting Hibernate to run
 outside of Spring and friends can quickly grow tiresome.

 T
 
  On 2 March 2012 09:46, Kunaal kunaalbha...@gmail.com wrote:
 
  Are you looking to use DistributedCache for better performance?
 
  On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
  geoffry.robe...@gmail.comwrote:
 
  This is a tardy response.  I'm spread pretty thinly right now.
 
  DistributedCache
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  is
  apparently deprecated.  Is there a replacement?  I didn't see anything
  about this in the documentation, but then I am still using 0.21.0. I
 have
  to for performance reasons.  1.0.1 is too slow and the client won't
 have
  it.
 
  Also, the DistributedCache
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  approach
  seems only to work from within a hadoop job.  i.e. From within a
  Mapper or a Reducer, but not from within a Driver.  I have libraries
  that I
  must access both from both places.  I take it that I am stuck keeping
 two
  copies of these libraries in synch--Correct?  It's either that, or copy
  them into hdfs, replacing them all at the beginning of each job run.
 
  Looking for best practices.
 
  Thanks
 
  On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:
 
  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  geoffry.robe...@gmail.com wrote:
 
  If I create an executable jar file that contains all dependencies
  required
  by the MR job do all said dependencies get distributed to all nodes?
  You can make a single jar and that will be distributed to all of the
  machines that run the task, but it is better in most cases to use the
  distributed cache.
 
  See
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  If I specify but one reducer, which node in the cluster will the
  reducer
  run on?
  The scheduling is done by the JobTracker and it isn't possible to
  control the location of the reducers.
 
  -- Owen
 
 
 
  --
  Geoffry Roberts
 
 
 
  --
  What we are is the universe's gift to us.
  What we become is our gift to the universe.
 
 
 


 --
 Regards / Med vennlig hilsen
 Tarjei Huse
 Mobil: 920 63 413




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Tarjei Huse
On 03/02/2012 07:59 PM, Geoffry Roberts wrote:
 Queries are nothing but inserts.  Create an object, populated it, persist
 it. If it worked, life would be good right now.

 I've considered JDBC and may yet take that approach.
I used Mybatis on a project now - also worth considering if you want a
more orm like feel to the job.

 re: Hibernate outside of Spring -- I'm getting tired already.

 Interesting thing:  I use EMF (Eclipse Modelling Framework).  The
 supporting jar files for emf and ecore are built into the job.  They are
 being found by the Driver(s) and the MR(s) no problemo.  If these work, why
 not the hibernate stuff?  Mystery!
I wish I knew. :)


T

 On 2 March 2012 10:50, Tarjei Huse tar...@scanmine.com wrote:

 On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
 No, I am using 0.21.0 for better performance.  I am interested in
 DistributedCache so certain libraries can be found during MR processing.
 As it is now, I'm getting ClassNotFoundException being thrown by the
 Reducers.  The Driver throws no error, the Reducer(s) does.  It would
 seem
 something is not being distributed across the cluster as I assumed it
 would.  After all, the whole business is in a single, executable jar
 file.

 How complex are the queries you are doing?

 Have you considered one of the following:

 1) Use plain jdbc instead of integrating Hibernate into Hadoop.
 2) Create a local version of the db that can be in the Distributed Cache.

 I tried using Hibernate with hadoop (the queries were not an important
 part of the size of the jobs) but I ran up against so many issues trying
 to get Hibernate to start up within the MR job that i ended up just
 exporting the tables, loading them into memory and doing queries against
 them with basic HashMap lookups.

 My best advice is that if you can, you should consider a way to abstract
 away Hibernate from the job and use something closer to the metal like
 either JDBC or just dump the data to files. Getting Hibernate to run
 outside of Spring and friends can quickly grow tiresome.

 T
 On 2 March 2012 09:46, Kunaal kunaalbha...@gmail.com wrote:

 Are you looking to use DistributedCache for better performance?

 On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
 geoffry.robe...@gmail.comwrote:

 This is a tardy response.  I'm spread pretty thinly right now.

 DistributedCache

 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 is
 apparently deprecated.  Is there a replacement?  I didn't see anything
 about this in the documentation, but then I am still using 0.21.0. I
 have
 to for performance reasons.  1.0.1 is too slow and the client won't
 have
 it.

 Also, the DistributedCache

 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 approach
 seems only to work from within a hadoop job.  i.e. From within a
 Mapper or a Reducer, but not from within a Driver.  I have libraries
 that I
 must access both from both places.  I take it that I am stuck keeping
 two
 copies of these libraries in synch--Correct?  It's either that, or copy
 them into hdfs, replacing them all at the beginning of each job run.

 Looking for best practices.

 Thanks

 On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
 geoffry.robe...@gmail.com wrote:

 If I create an executable jar file that contains all dependencies
 required
 by the MR job do all said dependencies get distributed to all nodes?
 You can make a single jar and that will be distributed to all of the
 machines that run the task, but it is better in most cases to use the
 distributed cache.

 See

 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 If I specify but one reducer, which node in the cluster will the
 reducer
 run on?
 The scheduling is done by the JobTracker and it isn't possible to
 control the location of the reducers.

 -- Owen


 --
 Geoffry Roberts


 --
 What we are is the universe's gift to us.
 What we become is our gift to the universe.



 --
 Regards / Med vennlig hilsen
 Tarjei Huse
 Mobil: 920 63 413





-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413



Hadoop and Hibernate

2012-02-28 Thread Geoffry Roberts
All,

I am trying to use Hibernate within my reducer and it goeth not well.  Has
anybody ever successfully done this?

I have a java package that contains my Hadoop driver, mapper, and reducer
along with a persistence class.  I call Hibernate from the cleanup() method
in my reducer class.  It complains that it cannot find the persistence
class.  The class is in the same package as the reducer and this all would
work outside of Hadoop. The error is thrown when I attempt to begin a
transaction.

The error:

org.hibernate.MappingException: Unknown entity: qq.mob.depart.EpiState

The code:

protected void cleanup(Context ctx) throws IOException,
   InterruptedException {
...
org.hibernate.cfg.Configuration cfg = new org.hibernate.cfg.Configuration();
SessionFactory sessionFactory =
cfg.configure(hibernate.cfg.xml).buildSessionFactory();
cfg.addAnnotatedClass(EpiState.class); // This class is in the same
package as the reducer.
Session session = sessionFactory.openSession();
Transaction tx = session.getTransaction();
tx.begin(); //Error is thrown here.
...
}

If I create an executable jar file that contains all dependencies required
by the MR job do all said dependencies get distributed to all nodes?

If I specify but one reducer, which node in the cluster will the reducer
run on?

Thanks
-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-02-28 Thread Owen O'Malley
On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
geoffry.robe...@gmail.com wrote:

 If I create an executable jar file that contains all dependencies required
 by the MR job do all said dependencies get distributed to all nodes?

You can make a single jar and that will be distributed to all of the
machines that run the task, but it is better in most cases to use the
distributed cache.

See 
http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

 If I specify but one reducer, which node in the cluster will the reducer
 run on?

The scheduling is done by the JobTracker and it isn't possible to
control the location of the reducers.

-- Owen