Packaging for Hadoop - what about the Hadoop libraries?
Hi, when packaging additional libraries for an MR job, I can use a script or a Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should I package them in, or should I rely on those jars that are already present in the Hadoop installation where the code will be running? What is the best practice? Thank you, Mark
Re: Packaging for Hadoop - what about the Hadoop libraries?
Just package the libraries that you MR jobs are dependent on. No need to package hadoop libraries. But make sure you hadoop client version matches with server version. Praveen On Feb 25, 2011, at 8:07 AM, ext Mark Kerzner markkerz...@gmail.com wrote: Hi, when packaging additional libraries for an MR job, I can use a script or a Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should I package them in, or should I rely on those jars that are already present in the Hadoop installation where the code will be running? What is the best practice? Thank you, Mark
Re: Packaging for Hadoop - what about the Hadoop libraries?
The ones that are present. It is a little tricky for the other ones however, well not really once you “get it” -libjars list of supporting jars on the commandline will ship the “supporting” jars out with the job to the map reducers, however if you, for some reason need them in the job submission they won’t be present, you either need to have those in the command line classpath or bundled. Cheers James. On 2011-02-25, at 6:06 AM, Mark Kerzner wrote: Hi, when packaging additional libraries for an MR job, I can use a script or a Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should I package them in, or should I rely on those jars that are already present in the Hadoop installation where the code will be running? What is the best practice? Thank you, Mark
definition of slots in Hadoop scheduling
Hi, How is task slot in Hadoop defined with respect to scheduling a map/reduce task on such slots available on TaskTrackers? Thanks, Bikash
Catching mapred exceptions on the client
Hello all, I have few mapreduce jobs that I am calling from a java driver. The problem I am facing is that when there is an exception in mapred job, the exception is not propogated to the client so even if first job failed, its going to second job and so on. Is there an another way of catching exceptions from mapred jobs on the client side? I am using hadoop-0.20.2. My Example is: Driver { try { Call MapredJob1; Call MapredJob2; .. .. }catch(Exception e) { throw new exception; } } When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are still executing. Any insight into it is appreciated. Praveen
Re: Catching mapred exceptions on the client
Hello, It is hard to give advice without the specific code. However, if you don’t have your job submission set up to wait for completion then it might be launching all your jobs at the same time. Check to see how your jobs are being submitted. Sorry, I can’t be more helpful. James On 2011-02-25, at 9:00 AM, praveen.pe...@nokia.com wrote: Hello all, I have few mapreduce jobs that I am calling from a java driver. The problem I am facing is that when there is an exception in mapred job, the exception is not propogated to the client so even if first job failed, its going to second job and so on. Is there an another way of catching exceptions from mapred jobs on the client side? I am using hadoop-0.20.2. My Example is: Driver { try { Call MapredJob1; Call MapredJob2; .. .. }catch(Exception e) { throw new exception; } } When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are still executing. Any insight into it is appreciated. Praveen
Re: definition of slots in Hadoop scheduling
Please see this archived thread for a very similar question on what tasks really are: http://mail-archives.apache.org/mod_mbox/hadoop-general/201011.mbox/%3c126335.8536...@web112111.mail.gq1.yahoo.com%3E Right now, they're just a cap number for parallelization, hand-configured and irrespective of the machine's capabilities. However, a Scheduler may take machine's states into account while assigning tasks to one. On Fri, Feb 25, 2011 at 7:22 PM, bikash sharma sharmabiks...@gmail.com wrote: Hi, How is task slot in Hadoop defined with respect to scheduling a map/reduce task on such slots available on TaskTrackers? Thanks, Bikash -- Harsh J www.harshj.com
Re: definition of slots in Hadoop scheduling
Thanks very much Harsh. It seems then that slots are not defined in terms of actual machine resource capacities in terms of cpu, memory, disk and network bandwidth. -bikash On Fri, Feb 25, 2011 at 11:33 AM, Harsh J qwertyman...@gmail.com wrote: Please see this archived thread for a very similar question on what tasks really are: http://mail-archives.apache.org/mod_mbox/hadoop-general/201011.mbox/%3c126335.8536...@web112111.mail.gq1.yahoo.com%3E Right now, they're just a cap number for parallelization, hand-configured and irrespective of the machine's capabilities. However, a Scheduler may take machine's states into account while assigning tasks to one. On Fri, Feb 25, 2011 at 7:22 PM, bikash sharma sharmabiks...@gmail.com wrote: Hi, How is task slot in Hadoop defined with respect to scheduling a map/reduce task on such slots available on TaskTrackers? Thanks, Bikash -- Harsh J www.harshj.com
RE: Catching mapred exceptions on the client
James, Thanks for the response. I am using waitForCompletion (job.waitForCompletion(true);) for each job. So the jobs are definitely executed sequentially. Praveen -Original Message- From: ext James Seigel [mailto:ja...@tynt.com] Sent: Friday, February 25, 2011 11:15 AM To: common-user@hadoop.apache.org Subject: Re: Catching mapred exceptions on the client Hello, It is hard to give advice without the specific code. However, if you don't have your job submission set up to wait for completion then it might be launching all your jobs at the same time. Check to see how your jobs are being submitted. Sorry, I can't be more helpful. James On 2011-02-25, at 9:00 AM, praveen.pe...@nokia.com wrote: Hello all, I have few mapreduce jobs that I am calling from a java driver. The problem I am facing is that when there is an exception in mapred job, the exception is not propogated to the client so even if first job failed, its going to second job and so on. Is there an another way of catching exceptions from mapred jobs on the client side? I am using hadoop-0.20.2. My Example is: Driver { try { Call MapredJob1; Call MapredJob2; .. .. }catch(Exception e) { throw new exception; } } When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are still executing. Any insight into it is appreciated. Praveen
Lost in HDFS_BYTES_READ/WRITTEN
Hello, please help me clear me ideas! When a reducer reads map-output data remotely ... Is that reflected in the HDFS_BYTES_READ? Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. first data read for maps as input and last data written from reducer as output for user to see. Thank you in advance, Maha
Re: Lost in HDFS_BYTES_READ/WRITTEN
From what I could gather, all FileSystem instances put in an entry into a static 'statistics' map. This map is used to update the counters for each Task. Hence, all operations done on the same HDFS URI by either the task or your application code, must be counted as one. In fact, even if you are reading off another HDFS, the scheme match is alone seen, so it would aggregate to the same counter as well. I'm not very sure of this though. Perhaps writing a simple test should be adequate to learn the truth. On Sat, Feb 26, 2011 at 1:04 AM, maha m...@umail.ucsb.edu wrote: Hello, please help me clear me ideas! When a reducer reads map-output data remotely ... Is that reflected in the HDFS_BYTES_READ? Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. first data read for maps as input and last data written from reducer as output for user to see. Thank you in advance, Maha -- Harsh J www.harshj.com
Re: Catching mapred exceptions on the client
Hello, On Sat, Feb 26, 2011 at 12:24 AM, praveen.pe...@nokia.com wrote: James, Thanks for the response. I am using waitForCompletion (job.waitForCompletion(true);) for each job. So the jobs are definitely executed sequentially. There's a Job.isSuccessful() call you could use, perhaps - after waitForCompletion(true) returns back. -- Harsh J www.harshj.com
Error in Reducer stderr preventing reducer from completing
I'm running into an inconsistent problem where with multiple reducers (and extremely small amounts of data) I'll hit a situation where a Reducer task attempt will hang at 33% until the JobTracker starts up a 2nd attempt. It does not seem to be tied to any 1 machine, and when I change the number of reducers to 1, it runs fine every time. log4j:ERROR Failed to flush writer, java.io.InterruptedIOException at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202) at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212) at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:58) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:316) at org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:58) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:199) at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.freeHost(ShuffleScheduler.java:345) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:152) Aaron Baff | Developer | Telescope, Inc. email: aaron.b...@telescope.tvmailto:aaron.b...@telescope.tv | office: 424 270 2913 | www.telescope.tvhttp://www.telescope.tv http://twitter.com/Telescope_IncAMERICAN IDOL is back and better than ever with a new team of judges on Season 10! Voting begins Tuesday, March 1st so be sure to tune in and vote to make sure your favorite contestant makes it through each week (check local listings for dates times). More info at www.americanidol.comhttp://www.americanidol.com/. The information contained in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Any views expressed in this message are those of the individual and may not necessarily reflect the views of Telescope Inc. or its associated companies.
Re: Lost in HDFS_BYTES_READ/WRITTEN
Thanks for your reply Harsh, but this is confusing me more :( I can't experiment this because I'm using a single machine now and everything is reported as Local read/written. or can I ? I'm using this line hdfs = FileSystem.get(getConf()); which I think means that the instance created is distributed. but the jobCoutners never uses it for intermediate results (Eg. for reducers to read map-outputs) So if you can answer my question further, I truly appreciate it ! Maha On Feb 25, 2011, at 12:00 PM, Harsh J wrote: From what I could gather, all FileSystem instances put in an entry into a static 'statistics' map. This map is used to update the counters for each Task. Hence, all operations done on the same HDFS URI by either the task or your application code, must be counted as one. In fact, even if you are reading off another HDFS, the scheme match is alone seen, so it would aggregate to the same counter as well. I'm not very sure of this though. Perhaps writing a simple test should be adequate to learn the truth. On Sat, Feb 26, 2011 at 1:04 AM, maha m...@umail.ucsb.edu wrote: Hello, please help me clear me ideas! When a reducer reads map-output data remotely ... Is that reflected in the HDFS_BYTES_READ? Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. first data read for maps as input and last data written from reducer as output for user to see. Thank you in advance, Maha -- Harsh J www.harshj.com
RE: Hadoop build fails
Download jdk1.5 Download apache-forrest ant -Djava5.home=/PATH_TO_jdk1.5 -Dforrest.home=/PATH_TO_apache-forrest tar However, suppose what you want is just a binary tar ball without source and documentations, Just do ant binary Regards, Tanping -Original Message- From: Mark Kerzner [mailto:markkerz...@gmail.com] Sent: Thursday, February 17, 2011 3:37 PM To: Hadoop Discussion Group Subject: Hadoop build fails Hi, I got the latest trunk out of svn, ran this command ant -Djavac.args=-Xlint -Xmaxwarns 1000 clean test tar and got the following error /hadoop-common-trunk/build.xml:950: 'forrest.home' is not defined. Please pass-Dforrest.home=base of Apache Forrest installation to Ant on the command-line. What should I do? Did I miss something in the documentation? Thank you, Mark
Re: is there more smarter way to execute a hadoop cluster?
hi, if it is possible, could you give me some examples to load configuration infos? I've tested it by testing the path of hadoop and hadoop/conf in my $CLASSPATH. -- not a solution for me. how could I load cluster configurations? thanks. 2011. 2. 25., 오후 3:32, Harsh J 작성: Hello again, Finals won't help all the logic you require to be performed in the front-end/Driver code. If you're using fs.default.name inside a Task somehow, final will help there. It is best if your application gets the right configuration files on its classpath itself, so that the right values are read (how else would it know your values!). Alternatively, you can use GenericOptionsParser to parse -fs and -jt arguments when the Driver is launched from commandline. On Fri, Feb 25, 2011 at 11:46 AM, Jun Young Kim juneng...@gmail.com wrote: Hi, Harsh. I've already tried to do use final tag to set it unmodifiable. but, my result is not different. *core-site.xml:* configuration property namefs.default.name/name valuehdfs://localhost/value finaltrue/final /property /configuration other *-site.xml files are also modified by this rule. thanks. Junyoung Kim (juneng...@gmail.com) On 02/25/2011 02:50 PM, Harsh J wrote: Hi, On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com wrote: hi, I got the reason of my problem. in case of submitting a job by shell, conf.get(fs.default.name) is hdfs://localhost in case of submitting a job by a java application directly, conf.get(fs.default.name) is file://localhost so I couldn't read any files from hdfs. I think the execution of my java app couldn't read *-site.xml configurations properly. Have a look at this Q: http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F -- Harsh J www.harshj.com
kmeans
Hii I m facing a problem...I m not able to run the kmeans clustering algo on a singls node...till now I have just run the wordcount program...what are the steps in doing so???