Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Gianluigi Zanetti
Could you please state exactly the steps you do in setting up the pipes
run?

The two critical things to watch are:

a/ where did you load the executable on hdfs e.g.,
$ ls pipes-bin
genreads
pair-reads
seqal

$ hadoop dfs -put pipes-bin pipes-bin

$ hadoop dfs -ls hdfs://host:53897/user/zag/pipes-bin
Found 3 items
-rw-r--r--   3 zag supergroup480 2010-03-17 12:28 
/user/zag/pipes-bin/genreads
-rw-r--r--   3 zag supergroup692 2010-03-17 15:02 
/user/zag/pipes-bin/pair_reads
-rw-r--r--   3 zag supergroup477 2010-03-17 12:33 
/user/zag/pipes-bin/seqal

b/ how you started the program

$ hadoop pipes -D hadoop.pipes.executable=pipes-bin/genreads -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input input  -output output


hope this is clear enough. 


--gianluigi

On Wed, 2010-03-31 at 06:57 -0700, Keith Wiley wrote:
> On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote:
> 
> > What happens if you try this:
> >
> > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D  
> > hadoop.pipes.executable=EXECUTABLE -D  
> > hadoop.pipes.java.recordreader=true -D  
> > hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output  
> > HDFSPATH/output
> 
> 
> Not good news.  This is what I got:
> 
> $ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ 
> hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D  
> hadoop.pipes.java.recordreader=true -D  
> hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ 
> input -output /uwphysics/kwiley/mosaic/output
> Exception in thread "main" java.io.FileNotFoundException: File does  
> not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
> Mosaic/src/cpp/Mosaic
>   at  
> org 
> .apache 
> .hadoop 
> .hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
> 457)
>   at  
> org 
> .apache 
> .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 
> 509)
>   at  
> org 
> .apache 
> .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
>   at  
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
>   at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
>   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
>   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
> 
> Incidentally, just in case you're wondering:
> $ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
> Mosaic/src/cpp/
> total 800
> 368 -rwxr-xr-x  1 keithwiley  keithwiley  185184 Mar 29 19:08 Mosaic*
> ...other files...
> 
> The path is obviously correct on my local machine.  The only  
> explanation is that Hadoop is looking for it on HDFS under that path.
> 
> I'm desperate.  I don't understand why I'm the only person who can get  
> this working.  Could you please describe to me the set of commands you  
> use to run a pipes program on a fully distributed cluster?
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com 
> music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly  
> teaches
> me to suspect that my own is also."
> --  Mark Twain
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Gianluigi Zanetti
What happens if you try this:

$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.executable=EXECUTABLE -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output 
HDFSPATH/output
> Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output

On Tue, 2010-03-30 at 15:05 -0700, Keith Wiley wrote:
> $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
> hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
> -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE
> Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output
> 10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076
> 10/03/30 14:57:06 INFO mapred.JobClient:  map 0% reduce 0%
> ^C
> $
> 
> At that point the terminal hung, so I eventually ctrl-Ced to break it.  Now 
> if I investigate the Hadoop task logs for the mapper, I see this:
> 
> stderr logs
> bash: 
> /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic:
>  cannot execute binary file
> 
> ...which makes perfect sense in light of the following:
> 
> $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
> Found 1 items
> -rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
> /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
> Found 1 items
> -rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
> /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $
> 
> Note that this is all in attempt to run an executable that was uploaded to 
> HDFS in advance.  In this example I am not attempting to run an executable 
> stored on my local machine.  Any attempt to do that results in a file not 
> found error:
> 
> $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
> hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
> -input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE
> Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>   at 
> org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
>   at 
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
>   at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
>   at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
>   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
>   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
> $
> 
> It's clearly looking or the executable in HDFS, not on the local system, thus 
> the file not found error.
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio 
> than
> when I entered."
>   -- Keith Wiley
> 
> 
> 
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
What are the symptoms? 
Pipes should run out of the box in a standard installation.
BTW what version of bash are you using? Is it bash 4.0 by any chance?
See https://issues.apache.org/jira/browse/HADOOP-6388

--gianluigi


On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote:
> My cluster admin noticed that there is some additional pipes package he could 
> add to the cluster configuration, but he admits to knowing very little about 
> how the C++ pipes component of Hadoop works.
> 
> Can you offer any insight into this cluster configuration package?  What 
> exactly does it do that makes a cluster capable of running pipes programs 
> (and what symptom should its absence present from a user's point of view)?
> 
> On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:
> 
> > Hello.
> > Did you try following the tutorial in 
> > http://wiki.apache.org/hadoop/C++WordCount ?
> > 
> > We use C++ pipes in production on a large cluster, and it works.
> > 
> > --gianluigi
> 
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>   -- Galileo Galilei
> 
> 
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
Hello.
Did you try following the tutorial in 
http://wiki.apache.org/hadoop/C++WordCount ?

We use C++ pipes in production on a large cluster, and it works.

--gianluigi


On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote:
> No responses yet, although I admit it's only been a few hours.
> 
> As a follow-up, permit me to pose the following question:
> 
> Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as 
> opposed to a pseudo-distributed system)?  I haven't found any definitive 
> clarification on this topic one way or the other.  The only statement that I 
> found in the least bit illuminating is in the O'Reilly book (not official 
> Hadoop documentation mind you), p.38, which states:
> 
> "To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes 
> doesn't run in standalone (local) mode, since it relies on Hadoop's 
> distributed cache mechanism, which works only when HDFS is running."
> 
> The phrasing of those statements is a little unclear in that the distinction 
> being made appears to be between standalone and pseudo-distributed mode, 
> without any specific reference to fully-distributed mode.  Namely, the 
> section that qualifies the need for pseudo-distributed mode (the need for 
> HDFS) would obviously also apply to full distributed mode despite the lack of 
> mention of fully distributed mode in the quoted section.  So can pipes run in 
> fully distributed mode or not?
> 
> Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet 
> and I don't know if I am wasting my time, if this is a truly impossible 
> effort or if it can be done and I simply haven't figured out how to do it yet.
> 
> Thanks for any help.
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>   -- Mark Twain
> 
> 
> 
> 


Re: sun grid engine and map reduce

2009-12-10 Thread Gianluigi Zanetti
Hello Himanshu.
Could you please describe in more detail your use case?

There are two basic Gridengine integration schemes:

1/ Native integration in grid engine
This is the one referred to in Dan Templeton's blog. It is based on the
assumption that hdfs is always running, and it will essentially have
your map-reduce job (including a per-job jobtracker) scheduled as a
parallel environment 'as-close-as-possible' to your hdfs data. 
It will be in GE 6.2u5, which is currently in beta and should be out any
moment now. It is possible to back-port to 6.2u4 and probably to 6.2u3.

2/ HOD integration
What hod does, in a nutshell, is to allocate a group of machines as a
parallel environment within GE and to run a jobtracker and a namenode
that will control the allocated machines. User will then submit their
jobs to the jobtracker and use the hdfs controlled by the namenode.
Of course, the resulting hadoop environment is transient since, as far
as GE is concerned, it is simply a parallel job. Of course, the meaning
of transient depends on how you set-up your queues.
We have developed a patch to add Gridengine support to hadoop hod,
http://issues.apache.org/jira/browse/HADOOP-6369
This is pretty undemanding on GE version, but it is not very efficient
hdfs wise, since gridengine is ignorant of hdfs data locality. In
practice, either you ask hod to use an independent hdfs that is always
up -- but there is no guarantee that the tasktracker nodes will be close
to the data -- or you upload your data to a new hdfs that will be
created by hod.

Thus, 1/ is definitely more efficient and 'cluster-wide' while 2/ is
more like a sort of cluster partitioning.



--gianluigi







On Wed, 2009-12-09 at 12:43 -0800, himanshu chandola wrote:
> Hi all,
> We are integrating the hadoop jobs with the sun grid engine. Most of
> the map reduce jobs that start on our cluster are sequential map and
> reduce. I also found integration guidelines
> here :http://blogs.sun.com/templedf/entry/beta_testing_the_sun_grid
> and http://blogs.sun.com/ravee/entry/creating_hadoop_pe_under_sge .
> 
> I wanted to know whether every sequential map-reduce job would be counted as 
> a separate job to sun sge. That's necessary because in total the sequential 
> map-reduce runs for days.
> 
> Thanks
> H
> 
>  Morpheus: Do you believe in fate, Neo?
> Neo: No.
> Morpheus: Why Not?
> Neo: Because I don't like the idea that I'm not in control of my life.
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com