Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
This is a tardy response.  I'm spread pretty thinly right now.

DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheis
apparently deprecated.  Is there a replacement?  I didn't see anything
about this in the documentation, but then I am still using 0.21.0. I have
to for performance reasons.  1.0.1 is too slow and the client won't have
it.

Also, the 
DistributedCachehttp://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCacheapproach
seems only to work from within a hadoop job.  i.e. From within a
Mapper or a Reducer, but not from within a Driver.  I have libraries that I
must access both from both places.  I take it that I am stuck keeping two
copies of these libraries in synch--Correct?  It's either that, or copy
them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

 On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
 geoffry.robe...@gmail.com wrote:

  If I create an executable jar file that contains all dependencies
 required
  by the MR job do all said dependencies get distributed to all nodes?

 You can make a single jar and that will be distributed to all of the
 machines that run the task, but it is better in most cases to use the
 distributed cache.

 See
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

  If I specify but one reducer, which node in the cluster will the reducer
  run on?

 The scheduling is done by the JobTracker and it isn't possible to
 control the location of the reducers.

 -- Owen




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
No, I am using 0.21.0 for better performance.  I am interested in
DistributedCache so certain libraries can be found during MR processing.
As it is now, I'm getting ClassNotFoundException being thrown by the
Reducers.  The Driver throws no error, the Reducer(s) does.  It would seem
something is not being distributed across the cluster as I assumed it
would.  After all, the whole business is in a single, executable jar file.

On 2 March 2012 09:46, Kunaal kunaalbha...@gmail.com wrote:

 Are you looking to use DistributedCache for better performance?

 On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
 geoffry.robe...@gmail.comwrote:

  This is a tardy response.  I'm spread pretty thinly right now.
 
  DistributedCache
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  is
  apparently deprecated.  Is there a replacement?  I didn't see anything
  about this in the documentation, but then I am still using 0.21.0. I have
  to for performance reasons.  1.0.1 is too slow and the client won't have
  it.
 
  Also, the DistributedCache
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  approach
  seems only to work from within a hadoop job.  i.e. From within a
  Mapper or a Reducer, but not from within a Driver.  I have libraries
 that I
  must access both from both places.  I take it that I am stuck keeping two
  copies of these libraries in synch--Correct?  It's either that, or copy
  them into hdfs, replacing them all at the beginning of each job run.
 
  Looking for best practices.
 
  Thanks
 
  On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:
 
   On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
   geoffry.robe...@gmail.com wrote:
  
If I create an executable jar file that contains all dependencies
   required
by the MR job do all said dependencies get distributed to all nodes?
  
   You can make a single jar and that will be distributed to all of the
   machines that run the task, but it is better in most cases to use the
   distributed cache.
  
   See
  
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  
If I specify but one reducer, which node in the cluster will the
  reducer
run on?
  
   The scheduling is done by the JobTracker and it isn't possible to
   control the location of the reducers.
  
   -- Owen
  
 
 
 
  --
  Geoffry Roberts
 



 --
 What we are is the universe's gift to us.
 What we become is our gift to the universe.




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Thanks Leo.  I appreciate your response.

Let me explain my situation more precisely.

I am running a series of MR sub-jobs all harnessed together so they run as
a single job.  The last MR sub-job does nothing more than aggregate the
output of the previous sub-job into a single file(s).  It does this, by
having but a single reducer.  I could eliminate this aggregation sub-job if
I could have the aforementioned previous sub-job insert its output into a
database instead of hdfs.  Doing this, would also eliminate my current
dependance on MultipleOutputs.

The trouble comes when the Reducer(s) cannot find the persistent objects
hence the dreaded CNFE.  I find this odd because they are in the same
package as the Reducer.

Your comment about the back end crying is duly noted.

btw,
MPI = Message Passing Interface?

On 2 March 2012 10:30, Leo Leung
 lle...@ddn.com wrote:

 Geoffry,

  Hadoop distributedCache (as of now) is used to cache M/R application
 specific files.
  These files are used by M/R app only and not the framework. (Normally as
 side-lookup)

  You can certainly try to use Hibernate to query your SQL based back-end
 within the M/R code.
  But think of what happens when a few hundred or thousands of M/R task do
 that concurrently.
  Your back-end is going to cry. (if it can - before it dies)

  So IMO,  prep your M/R job with distributedCache files (pull it down
 first) is a better approach.

  Also, MPI is pretty much out of question (not baked into the framework).
  You'll likely have to roll your own.  (And try to trick the JobTracker in
 not starting the same task)

  Anyone has a better solution for Geoffry?



 -Original Message-
 From: Geoffry Roberts [mailto:geoffry.robe...@gmail.com]
 Sent: Friday, March 02, 2012 9:42 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop and Hibernate

 This is a tardy response.  I'm spread pretty thinly right now.

 DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 is
 apparently deprecated.  Is there a replacement?  I didn't see anything
 about this in the documentation, but then I am still using 0.21.0. I have
 to for performance reasons.  1.0.1 is too slow and the client won't have it.

 Also, the DistributedCache
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
 approach
 seems only to work from within a hadoop job.  i.e. From within a Mapper or
 a Reducer, but not from within a Driver.  I have libraries that I must
 access both from both places.  I take it that I am stuck keeping two copies
 of these libraries in synch--Correct?  It's either that, or copy them into
 hdfs, replacing them all at the beginning of each job run.

 Looking for best practices.

 Thanks

 On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:

  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  geoffry.robe...@gmail.com wrote:
 
   If I create an executable jar file that contains all dependencies
  required
   by the MR job do all said dependencies get distributed to all nodes?
 
  You can make a single jar and that will be distributed to all of the
  machines that run the task, but it is better in most cases to use the
  distributed cache.
 
  See
  http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
  ibutedCache
 
   If I specify but one reducer, which node in the cluster will the
   reducer run on?
 
  The scheduling is done by the JobTracker and it isn't possible to
  control the location of the reducers.
 
  -- Owen
 



 --
 Geoffry Roberts




-- 
Geoffry Roberts


Re: Hadoop and Hibernate

2012-03-02 Thread Geoffry Roberts
Queries are nothing but inserts.  Create an object, populated it, persist
it. If it worked, life would be good right now.

I've considered JDBC and may yet take that approach.

re: Hibernate outside of Spring -- I'm getting tired already.

Interesting thing:  I use EMF (Eclipse Modelling Framework).  The
supporting jar files for emf and ecore are built into the job.  They are
being found by the Driver(s) and the MR(s) no problemo.  If these work, why
not the hibernate stuff?  Mystery!

On 2 March 2012 10:50, Tarjei Huse tar...@scanmine.com wrote:

 On 03/02/2012 07:31 PM, Geoffry Roberts wrote:
  No, I am using 0.21.0 for better performance.  I am interested in
  DistributedCache so certain libraries can be found during MR processing.
  As it is now, I'm getting ClassNotFoundException being thrown by the
  Reducers.  The Driver throws no error, the Reducer(s) does.  It would
 seem
  something is not being distributed across the cluster as I assumed it
  would.  After all, the whole business is in a single, executable jar
 file.

 How complex are the queries you are doing?

 Have you considered one of the following:

 1) Use plain jdbc instead of integrating Hibernate into Hadoop.
 2) Create a local version of the db that can be in the Distributed Cache.

 I tried using Hibernate with hadoop (the queries were not an important
 part of the size of the jobs) but I ran up against so many issues trying
 to get Hibernate to start up within the MR job that i ended up just
 exporting the tables, loading them into memory and doing queries against
 them with basic HashMap lookups.

 My best advice is that if you can, you should consider a way to abstract
 away Hibernate from the job and use something closer to the metal like
 either JDBC or just dump the data to files. Getting Hibernate to run
 outside of Spring and friends can quickly grow tiresome.

 T
 
  On 2 March 2012 09:46, Kunaal kunaalbha...@gmail.com wrote:
 
  Are you looking to use DistributedCache for better performance?
 
  On Fri, Mar 2, 2012 at 9:42 AM, Geoffry Roberts
  geoffry.robe...@gmail.comwrote:
 
  This is a tardy response.  I'm spread pretty thinly right now.
 
  DistributedCache
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  is
  apparently deprecated.  Is there a replacement?  I didn't see anything
  about this in the documentation, but then I am still using 0.21.0. I
 have
  to for performance reasons.  1.0.1 is too slow and the client won't
 have
  it.
 
  Also, the DistributedCache
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  approach
  seems only to work from within a hadoop job.  i.e. From within a
  Mapper or a Reducer, but not from within a Driver.  I have libraries
  that I
  must access both from both places.  I take it that I am stuck keeping
 two
  copies of these libraries in synch--Correct?  It's either that, or copy
  them into hdfs, replacing them all at the beginning of each job run.
 
  Looking for best practices.
 
  Thanks
 
  On 28 February 2012 10:17, Owen O'Malley omal...@apache.org wrote:
 
  On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
  geoffry.robe...@gmail.com wrote:
 
  If I create an executable jar file that contains all dependencies
  required
  by the MR job do all said dependencies get distributed to all nodes?
  You can make a single jar and that will be distributed to all of the
  machines that run the task, but it is better in most cases to use the
  distributed cache.
 
  See
 
 
 http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache
  If I specify but one reducer, which node in the cluster will the
  reducer
  run on?
  The scheduling is done by the JobTracker and it isn't possible to
  control the location of the reducers.
 
  -- Owen
 
 
 
  --
  Geoffry Roberts
 
 
 
  --
  What we are is the universe's gift to us.
  What we become is our gift to the universe.
 
 
 


 --
 Regards / Med vennlig hilsen
 Tarjei Huse
 Mobil: 920 63 413




-- 
Geoffry Roberts


Hadoop and Hibernate

2012-02-28 Thread Geoffry Roberts
All,

I am trying to use Hibernate within my reducer and it goeth not well.  Has
anybody ever successfully done this?

I have a java package that contains my Hadoop driver, mapper, and reducer
along with a persistence class.  I call Hibernate from the cleanup() method
in my reducer class.  It complains that it cannot find the persistence
class.  The class is in the same package as the reducer and this all would
work outside of Hadoop. The error is thrown when I attempt to begin a
transaction.

The error:

org.hibernate.MappingException: Unknown entity: qq.mob.depart.EpiState

The code:

protected void cleanup(Context ctx) throws IOException,
   InterruptedException {
...
org.hibernate.cfg.Configuration cfg = new org.hibernate.cfg.Configuration();
SessionFactory sessionFactory =
cfg.configure(hibernate.cfg.xml).buildSessionFactory();
cfg.addAnnotatedClass(EpiState.class); // This class is in the same
package as the reducer.
Session session = sessionFactory.openSession();
Transaction tx = session.getTransaction();
tx.begin(); //Error is thrown here.
...
}

If I create an executable jar file that contains all dependencies required
by the MR job do all said dependencies get distributed to all nodes?

If I specify but one reducer, which node in the cluster will the reducer
run on?

Thanks
-- 
Geoffry Roberts


Runtime Comparison of Hadoop 0.21.0 and 1.0.1

2012-02-22 Thread Geoffry Roberts
All,

I saw the announcement that hadoop 1.0.1 micro release was available.  I
have been waiting for this because I need the MutipleOutputs capability,
which 1.0.0 didn't support.  I grabbed a copy of the release candidate.  I
was happy to see that the directory structure once again conforms (mostly)
to the older releases as opposed to what was in the 1.0.0 release.

I did a comparison of run times between 1.0.1 and 0.21.0, which is my
production version.  It seems that 1.0.1 runs about four times slower than
0.21.0.

With the same code, same hardware, same configuration, and the same data
set; end to end times are:

0.21.0 =   8.83 minutes.
1.0.1   = 30.26 minutes.

Is this a known condition?

Thanks

-- 
Geoffry Roberts


Re: Same Hadoop, Same MR Job, Different cluster

2012-02-09 Thread Geoffry Roberts
All,

This a follow up to my last post.

Turns out there was yet another hadoop cluster available running Ubuntu
10.10 on identical hardware to what i was using in may last post.  The
difference is between the two versions of Ubuntu: 10.10 vs 11.10. All else
is equal.

Things work on 10.10.  Things do not work on 11.10.

I can't explain.  I'm not going to try.  I'm rolling back to 10.10 for my
project.

On 8 February 2012 13:45, Geoffry Roberts geoffry.robe...@gmail.com wrote:

 All,

 I am setting up a new cluster on some new machines and am experiencing
 errors that did not happen before.  I am using the same Hadoop-1.0 with the
 same configuration, and the same jvm on both machines.  The thing that is
 different is the OS. The old cluster was Ubuntu 10.10 and the new is Ubuntu
 11.10.

 Since the OS seems to be the difference, are there any issues with
 Hadoop-1.0 and Ubuntu 11.10?

 Also, I remember at one time Hadoop having troubles with ipv6.  Do I still
 need to disable it?

 The same job runs on the old but not on the new.

 Error message is below.

 Thanks

 12/02/08 13:29:39 INFO mapred.JobClient:  map 0% reduce 0%
 12/02/08 13:29:46 INFO mapred.JobClient: Task Id :
 attempt_201202081216_0003_m_
 99_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

 12/02/08 13:29:46 WARN mapred.JobClient: Error reading task
 outputhttp://q001q:50060/tasklog?plaintext=trueattemptid=attempt_201202081216_0003_m_99_0filter=stdout
 12/02/08 13:29:46 WARN mapred.JobClient: Error reading task
 outputhttp://q001q:50060/tasklog?plaintext=trueattemptid=attempt_201202081216_0003_m_99_0filter=stderr
 12/02/08 13:29:52 INFO mapred.JobClient: Task Id :
 attempt_201202081216_0003_r_11_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


 --
 Geoffry Roberts




-- 
Geoffry Roberts


Same Hadoop, Same MR Job, Different cluster

2012-02-08 Thread Geoffry Roberts
All,

I am setting up a new cluster on some new machines and am experiencing
errors that did not happen before.  I am using the same Hadoop-1.0 with the
same configuration, and the same jvm on both machines.  The thing that is
different is the OS. The old cluster was Ubuntu 10.10 and the new is Ubuntu
11.10.

Since the OS seems to be the difference, are there any issues with
Hadoop-1.0 and Ubuntu 11.10?

Also, I remember at one time Hadoop having troubles with ipv6.  Do I still
need to disable it?

The same job runs on the old but not on the new.

Error message is below.

Thanks

12/02/08 13:29:39 INFO mapred.JobClient:  map 0% reduce 0%
12/02/08 13:29:46 INFO mapred.JobClient: Task Id :
attempt_201202081216_0003_m_
99_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/02/08 13:29:46 WARN mapred.JobClient: Error reading task
outputhttp://q001q:50060/tasklog?plaintext=trueattemptid=attempt_201202081216_0003_m_99_0filter=stdout
12/02/08 13:29:46 WARN mapred.JobClient: Error reading task
outputhttp://q001q:50060/tasklog?plaintext=trueattemptid=attempt_201202081216_0003_m_99_0filter=stderr
12/02/08 13:29:52 INFO mapred.JobClient: Task Id :
attempt_201202081216_0003_r_11_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


-- 
Geoffry Roberts


Seeking Advice on Upgrading a Cluster

2011-04-22 Thread Geoffry Roberts
All,

 I am a developer, not a super networking guy or hardware guy, and new to
Hadoop.

I'm working a research project. Funds are limited.  I have a compute problem
where I need to get the performance up on the processing of large text files
and no doubt Hadoop can help if I do things well.

I am cobbling my cluster together, to the greatest extent possible, out of
spare parts.  I can spend some money, but must do so with deliberation and
prudence.

I have at my disposal twelve, one time desk top computers:

   - Pentium 4 3.80GHz
   - 2-4G of memory
   - 1 Gigabit NIC
   - 1 Disk, Serial ATA/150 7,200 RPM

I have installed:

   - Ubuntu 10.10 /64 server
   - JDK /64
   - Hadoop 0.21.0

Processing is still slow.  I am tuning Hadoop, but I'm guessing I should
also upgrade my hardware.

What will give me the most bang for my buck?

   - Should I bring all machines up to 8G of memory? or is 4G good enough?
   (8 is the max.)
   - Should I double up the NICs and use LACP?
   - Should I double up the disks and attempt to flow my I/O from one disk
   to the another on the theory that this will minimizing contention?
   - Should I get another switch?  (I have a 10/100, 24 port Dlink and it's
   about 5 years old.)

Thanks in advance
-- 
Geoffry Roberts