Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread Samuel Guo
maybe you can use
bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

see details:
http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote:

 On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED]
 wrote:
  On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:
 
  On Sep 21, 2008, at 2:05 PM, David Hall wrote:
 
  (New to this list)
 
  Hi,
 
  My research group is setting up a small (20-node) cluster. All of
  these machines are linked by NFS. We have a fairly entrenched
  codebase/development cycle, and in particular we'd like to be able to
  access user $CLASSPATHs in the forked jvms run by the Map and Reduce
  tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
  disallow this by specifying it's own.
 
 
  Using jars on NFS for too many tasks might hurt if you have thousands of
  tasks, causing too much load.
 
  The better solution might be to use the DistributedCache:
 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
 
  Specifically:
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
  Arun
 
  Good point.. I hadn't thought of that, but at the moment we're dealing
  with barrier-to-adoption rather than efficiency. We'll have to go back
  to PBS if we can't get users (read: picky phd students) on board. I'd
  rather avoid that scenario...
 
  In the meantime, I think I figured out a hack that I'm going to try.

 In case anyone's curious, the hack is to create a jar file with a
 manifest that has the Class-Path field set to all the directories and
 jars you want, and to put that in the lib/ folder of another jar, and
 pass that final jar in as the User Jar to a job.

 Works like a charm. :-)

 -- David

 
  Thanks!
 
  -- David
 
 
  Is there any easy way to trick hadoop into making these visible? If
  not, if I were to submit a patch that would (optionally) add
  $CLASSPATH to the forked jvms' classpath, would it be considered?
 
  Thanks,
  David Hall
 
 
 



Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread Joe Shaw
Hi,

On Fri, Sep 26, 2008 at 10:50 AM, Samuel Guo [EMAIL PROTECTED] wrote:
 maybe you can use
 bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

 see details:
 http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

Indeed, I was having the same issue trying to get a Lucene jar file
into a running task.  Despite what the docs say, it works with the
jar option to the hadoop command.  (The docs I read said it only
worked with job and a couple other commands; unfortunately I don't
have a link to that page at the moment.)

Joe


Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-26 Thread David Hall
On Fri, Sep 26, 2008 at 7:50 AM, Samuel Guo [EMAIL PROTECTED] wrote:
 maybe you can use
 bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args

 see details:
 http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html

Most of our classes are in non-jars. I suppose it wouldn't be too bad
to tell ant to jar them up, but with the hack, it's easy enough to not
bother.

-- David


 On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote:

 On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED]
 wrote:
  On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED]
 wrote:
 
  On Sep 21, 2008, at 2:05 PM, David Hall wrote:
 
  (New to this list)
 
  Hi,
 
  My research group is setting up a small (20-node) cluster. All of
  these machines are linked by NFS. We have a fairly entrenched
  codebase/development cycle, and in particular we'd like to be able to
  access user $CLASSPATHs in the forked jvms run by the Map and Reduce
  tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
  disallow this by specifying it's own.
 
 
  Using jars on NFS for too many tasks might hurt if you have thousands of
  tasks, causing too much load.
 
  The better solution might be to use the DistributedCache:
 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache
 
  Specifically:
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29
 
  Arun
 
  Good point.. I hadn't thought of that, but at the moment we're dealing
  with barrier-to-adoption rather than efficiency. We'll have to go back
  to PBS if we can't get users (read: picky phd students) on board. I'd
  rather avoid that scenario...
 
  In the meantime, I think I figured out a hack that I'm going to try.

 In case anyone's curious, the hack is to create a jar file with a
 manifest that has the Class-Path field set to all the directories and
 jars you want, and to put that in the lib/ folder of another jar, and
 pass that final jar in as the User Jar to a job.

 Works like a charm. :-)

 -- David

 
  Thanks!
 
  -- David
 
 
  Is there any easy way to trick hadoop into making these visible? If
  not, if I were to submit a patch that would (optionally) add
  $CLASSPATH to the forked jvms' classpath, would it be considered?
 
  Thanks,
  David Hall
 
 
 




Adding $CLASSPATH to Map/Reduce tasks

2008-09-21 Thread David Hall
(New to this list)

Hi,

My research group is setting up a small (20-node) cluster. All of
these machines are linked by NFS. We have a fairly entrenched
codebase/development cycle, and in particular we'd like to be able to
access user $CLASSPATHs in the forked jvms run by the Map and Reduce
tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
disallow this by specifying it's own.

Is there any easy way to trick hadoop into making these visible? If
not, if I were to submit a patch that would (optionally) add
$CLASSPATH to the forked jvms' classpath, would it be considered?

Thanks,
David Hall


Re: Adding $CLASSPATH to Map/Reduce tasks

2008-09-21 Thread Arun C Murthy


On Sep 21, 2008, at 2:05 PM, David Hall wrote:


(New to this list)

Hi,

My research group is setting up a small (20-node) cluster. All of
these machines are linked by NFS. We have a fairly entrenched
codebase/development cycle, and in particular we'd like to be able to
access user $CLASSPATHs in the forked jvms run by the Map and Reduce
tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to
disallow this by specifying it's own.



Using jars on NFS for too many tasks might hurt if you have thousands  
of tasks, causing too much load.


The better solution might be to use the DistributedCache:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache

Specifically:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)

Arun


Is there any easy way to trick hadoop into making these visible? If
not, if I were to submit a patch that would (optionally) add
$CLASSPATH to the forked jvms' classpath, would it be considered?

Thanks,
David Hall