Re: C++ pipes on full (nonpseudo) cluster
Could you please state exactly the steps you do in setting up the pipes run? The two critical things to watch are: a/ where did you load the executable on hdfs e.g., $ ls pipes-bin genreads pair-reads seqal $ hadoop dfs -put pipes-bin pipes-bin $ hadoop dfs -ls hdfs://host:53897/user/zag/pipes-bin Found 3 items -rw-r--r-- 3 zag supergroup480 2010-03-17 12:28 /user/zag/pipes-bin/genreads -rw-r--r-- 3 zag supergroup692 2010-03-17 15:02 /user/zag/pipes-bin/pair_reads -rw-r--r-- 3 zag supergroup477 2010-03-17 12:33 /user/zag/pipes-bin/seqal b/ how you started the program $ hadoop pipes -D hadoop.pipes.executable=pipes-bin/genreads -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input input -output output hope this is clear enough. --gianluigi On Wed, 2010-03-31 at 06:57 -0700, Keith Wiley wrote: > On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote: > > > What happens if you try this: > > > > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > > hadoop.pipes.executable=EXECUTABLE -D > > hadoop.pipes.java.recordreader=true -D > > hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output > > HDFSPATH/output > > > Not good news. This is what I got: > > $ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ > hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D > hadoop.pipes.java.recordreader=true -D > hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ > input -output /uwphysics/kwiley/mosaic/output > Exception in thread "main" java.io.FileNotFoundException: File does > not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ > Mosaic/src/cpp/Mosaic > at > org > .apache > .hadoop > .hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: > 457) > at > org > .apache > .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: > 509) > at > org > .apache > .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) > at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) > at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) > at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) > > Incidentally, just in case you're wondering: > $ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ > Mosaic/src/cpp/ > total 800 > 368 -rwxr-xr-x 1 keithwiley keithwiley 185184 Mar 29 19:08 Mosaic* > ...other files... > > The path is obviously correct on my local machine. The only > explanation is that Hadoop is looking for it on HDFS under that path. > > I'm desperate. I don't understand why I'm the only person who can get > this working. Could you please describe to me the set of commands you > use to run a pipes program on a fully distributed cluster? > > > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "The easy confidence with which I know another man's religion is folly > teaches > me to suspect that my own is also." > -- Mark Twain > >
Re: C++ pipes on full (nonpseudo) cluster
What happens if you try this: $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.executable=EXECUTABLE -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output > Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output On Tue, 2010-03-30 at 15:05 -0700, Keith Wiley wrote: > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true > -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE > Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output > 10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set. User classes > may not be found. See JobConf(Class) or JobConf#setJar(String). > 10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : > 1 > 10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076 > 10/03/30 14:57:06 INFO mapred.JobClient: map 0% reduce 0% > ^C > $ > > At that point the terminal hung, so I eventually ctrl-Ced to break it. Now > if I investigate the Hadoop task logs for the mapper, I see this: > > stderr logs > bash: > /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic: > cannot execute binary file > > ...which makes perfect sense in light of the following: > > $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin > Found 1 items > -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 > /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin > Found 1 items > -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 > /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ > > Note that this is all in attempt to run an executable that was uploaded to > HDFS in advance. In this example I am not attempting to run an executable > stored on my local machine. Any attempt to do that results in a file not > found error: > > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true > -input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE > Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at > org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) > at > org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) > at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) > at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) > at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) > $ > > It's clearly looking or the executable in HDFS, not on the local system, thus > the file not found error. > > > Keith Wiley kwi...@keithwiley.com > www.keithwiley.com > > "What I primarily learned in grad school is how much I *don't* know. > Consequently, I left grad school with a higher ignorance to knowledge ratio > than > when I entered." > -- Keith Wiley > > > > >
Re: C++ pipes on full (nonpseudo) cluster
What are the symptoms? Pipes should run out of the box in a standard installation. BTW what version of bash are you using? Is it bash 4.0 by any chance? See https://issues.apache.org/jira/browse/HADOOP-6388 --gianluigi On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote: > My cluster admin noticed that there is some additional pipes package he could > add to the cluster configuration, but he admits to knowing very little about > how the C++ pipes component of Hadoop works. > > Can you offer any insight into this cluster configuration package? What > exactly does it do that makes a cluster capable of running pipes programs > (and what symptom should its absence present from a user's point of view)? > > On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote: > > > Hello. > > Did you try following the tutorial in > > http://wiki.apache.org/hadoop/C++WordCount ? > > > > We use C++ pipes in production on a large cluster, and it works. > > > > --gianluigi > > > > Keith Wiley kwi...@keithwiley.com > www.keithwiley.com > > "I do not feel obliged to believe that the same God who has endowed us with > sense, reason, and intellect has intended us to forgo their use." > -- Galileo Galilei > > > >
Re: C++ pipes on full (nonpseudo) cluster
Hello. Did you try following the tutorial in http://wiki.apache.org/hadoop/C++WordCount ? We use C++ pipes in production on a large cluster, and it works. --gianluigi On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote: > No responses yet, although I admit it's only been a few hours. > > As a follow-up, permit me to pose the following question: > > Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as > opposed to a pseudo-distributed system)? I haven't found any definitive > clarification on this topic one way or the other. The only statement that I > found in the least bit illuminating is in the O'Reilly book (not official > Hadoop documentation mind you), p.38, which states: > > "To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes > doesn't run in standalone (local) mode, since it relies on Hadoop's > distributed cache mechanism, which works only when HDFS is running." > > The phrasing of those statements is a little unclear in that the distinction > being made appears to be between standalone and pseudo-distributed mode, > without any specific reference to fully-distributed mode. Namely, the > section that qualifies the need for pseudo-distributed mode (the need for > HDFS) would obviously also apply to full distributed mode despite the lack of > mention of fully distributed mode in the quoted section. So can pipes run in > fully distributed mode or not? > > Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet > and I don't know if I am wasting my time, if this is a truly impossible > effort or if it can be done and I simply haven't figured out how to do it yet. > > Thanks for any help. > > > Keith Wiley kwi...@keithwiley.com > www.keithwiley.com > > "The easy confidence with which I know another man's religion is folly teaches > me to suspect that my own is also." > -- Mark Twain > > > >
Re: sun grid engine and map reduce
Hello Himanshu. Could you please describe in more detail your use case? There are two basic Gridengine integration schemes: 1/ Native integration in grid engine This is the one referred to in Dan Templeton's blog. It is based on the assumption that hdfs is always running, and it will essentially have your map-reduce job (including a per-job jobtracker) scheduled as a parallel environment 'as-close-as-possible' to your hdfs data. It will be in GE 6.2u5, which is currently in beta and should be out any moment now. It is possible to back-port to 6.2u4 and probably to 6.2u3. 2/ HOD integration What hod does, in a nutshell, is to allocate a group of machines as a parallel environment within GE and to run a jobtracker and a namenode that will control the allocated machines. User will then submit their jobs to the jobtracker and use the hdfs controlled by the namenode. Of course, the resulting hadoop environment is transient since, as far as GE is concerned, it is simply a parallel job. Of course, the meaning of transient depends on how you set-up your queues. We have developed a patch to add Gridengine support to hadoop hod, http://issues.apache.org/jira/browse/HADOOP-6369 This is pretty undemanding on GE version, but it is not very efficient hdfs wise, since gridengine is ignorant of hdfs data locality. In practice, either you ask hod to use an independent hdfs that is always up -- but there is no guarantee that the tasktracker nodes will be close to the data -- or you upload your data to a new hdfs that will be created by hod. Thus, 1/ is definitely more efficient and 'cluster-wide' while 2/ is more like a sort of cluster partitioning. --gianluigi On Wed, 2009-12-09 at 12:43 -0800, himanshu chandola wrote: > Hi all, > We are integrating the hadoop jobs with the sun grid engine. Most of > the map reduce jobs that start on our cluster are sequential map and > reduce. I also found integration guidelines > here :http://blogs.sun.com/templedf/entry/beta_testing_the_sun_grid > and http://blogs.sun.com/ravee/entry/creating_hadoop_pe_under_sge . > > I wanted to know whether every sequential map-reduce job would be counted as > a separate job to sun sge. That's necessary because in total the sequential > map-reduce runs for days. > > Thanks > H > > Morpheus: Do you believe in fate, Neo? > Neo: No. > Morpheus: Why Not? > Neo: Because I don't like the idea that I'm not in control of my life. > > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com