Packaging for Hadoop - what about the Hadoop libraries?

2011-02-25 Thread Mark Kerzner
Hi,

when packaging additional libraries for an MR job, I can use a script or a
Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should
I package them in, or should I rely on those jars that are already present
in the Hadoop installation where the code will be running? What is the best
practice?

Thank you,
Mark


Re: Packaging for Hadoop - what about the Hadoop libraries?

2011-02-25 Thread praveen.peddi
Just package the libraries that you MR jobs are dependent on. No need to 
package hadoop libraries. But make sure you hadoop client version matches with 
server version.

Praveen

On Feb 25, 2011, at 8:07 AM, ext Mark Kerzner markkerz...@gmail.com wrote:

 Hi,
 
 when packaging additional libraries for an MR job, I can use a script or a
 Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should
 I package them in, or should I rely on those jars that are already present
 in the Hadoop installation where the code will be running? What is the best
 practice?
 
 Thank you,
 Mark


Re: Packaging for Hadoop - what about the Hadoop libraries?

2011-02-25 Thread James Seigel
The ones that are present.

It is a little tricky for the other ones however, well not really once you “get 
it”

-libjars list of supporting jars on the commandline will ship the 
“supporting” jars out with the job to the map reducers, however if you, for 
some reason need them in the job submission they won’t be present,  you either 
need to have those in the command line classpath or bundled.

Cheers
James.


On 2011-02-25, at 6:06 AM, Mark Kerzner wrote:

 Hi,
 
 when packaging additional libraries for an MR job, I can use a script or a
 Maven Hadoop plugin, but what about the Hadoop libraries themselves? Should
 I package them in, or should I rely on those jars that are already present
 in the Hadoop installation where the code will be running? What is the best
 practice?
 
 Thank you,
 Mark



definition of slots in Hadoop scheduling

2011-02-25 Thread bikash sharma
Hi,
How is task slot in Hadoop defined with respect to scheduling a map/reduce
task on such slots available on TaskTrackers?

Thanks,
Bikash


Catching mapred exceptions on the client

2011-02-25 Thread praveen.peddi
Hello all,
I have few mapreduce jobs that I am calling from a java driver. The problem I 
am facing is that when there is an exception in mapred job, the exception is 
not propogated to the client so even if first job failed, its going to second 
job and so on. Is there an another way of catching exceptions from mapred jobs 
on the client side?

I am using hadoop-0.20.2.

My Example is:
Driver {
 try {
Call MapredJob1;
Call MapredJob2;
..
..
}catch(Exception e) {
throw new exception;
}
}

When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are still 
executing.

Any insight into it is appreciated.

Praveen



Re: Catching mapred exceptions on the client

2011-02-25 Thread James Seigel
Hello,

It is hard to give advice without the specific code.  However, if you don’t 
have your job submission set up to wait for completion then it might be 
launching all your jobs at the same time.

Check to see how your jobs are being submitted.

Sorry, I can’t be more helpful.

James


On 2011-02-25, at 9:00 AM, praveen.pe...@nokia.com wrote:

 Hello all,
 I have few mapreduce jobs that I am calling from a java driver. The problem I 
 am facing is that when there is an exception in mapred job, the exception is 
 not propogated to the client so even if first job failed, its going to second 
 job and so on. Is there an another way of catching exceptions from mapred 
 jobs on the client side?
 
 I am using hadoop-0.20.2.
 
 My Example is:
 Driver {
 try {
Call MapredJob1;
Call MapredJob2;
..
..
}catch(Exception e) {
throw new exception;
}
 }
 
 When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are 
 still executing.
 
 Any insight into it is appreciated.
 
 Praveen
 



Re: definition of slots in Hadoop scheduling

2011-02-25 Thread Harsh J
Please see this archived thread for a very similar question on what
tasks really are:
http://mail-archives.apache.org/mod_mbox/hadoop-general/201011.mbox/%3c126335.8536...@web112111.mail.gq1.yahoo.com%3E

Right now, they're just a cap number for parallelization,
hand-configured and irrespective of the machine's capabilities.
However, a Scheduler may take machine's states into account while
assigning tasks to one.

On Fri, Feb 25, 2011 at 7:22 PM, bikash sharma sharmabiks...@gmail.com wrote:
 Hi,
 How is task slot in Hadoop defined with respect to scheduling a map/reduce
 task on such slots available on TaskTrackers?

 Thanks,
 Bikash




-- 
Harsh J
www.harshj.com


Re: definition of slots in Hadoop scheduling

2011-02-25 Thread bikash sharma
Thanks very much Harsh. It seems then that slots are not defined in terms of
actual machine resource capacities in terms of cpu, memory, disk and network
bandwidth.

-bikash

On Fri, Feb 25, 2011 at 11:33 AM, Harsh J qwertyman...@gmail.com wrote:

 Please see this archived thread for a very similar question on what
 tasks really are:

 http://mail-archives.apache.org/mod_mbox/hadoop-general/201011.mbox/%3c126335.8536...@web112111.mail.gq1.yahoo.com%3E

 Right now, they're just a cap number for parallelization,
 hand-configured and irrespective of the machine's capabilities.
 However, a Scheduler may take machine's states into account while
 assigning tasks to one.

 On Fri, Feb 25, 2011 at 7:22 PM, bikash sharma sharmabiks...@gmail.com
 wrote:
  Hi,
  How is task slot in Hadoop defined with respect to scheduling a
 map/reduce
  task on such slots available on TaskTrackers?
 
  Thanks,
  Bikash
 



 --
 Harsh J
 www.harshj.com



RE: Catching mapred exceptions on the client

2011-02-25 Thread praveen.peddi
James,
Thanks for the response. I am using waitForCompletion 
(job.waitForCompletion(true);) for each job. So the jobs are definitely 
executed sequentially.

Praveen

-Original Message-
From: ext James Seigel [mailto:ja...@tynt.com] 
Sent: Friday, February 25, 2011 11:15 AM
To: common-user@hadoop.apache.org
Subject: Re: Catching mapred exceptions on the client

Hello,

It is hard to give advice without the specific code.  However, if you don't 
have your job submission set up to wait for completion then it might be 
launching all your jobs at the same time.

Check to see how your jobs are being submitted.

Sorry, I can't be more helpful.

James


On 2011-02-25, at 9:00 AM, praveen.pe...@nokia.com wrote:

 Hello all,
 I have few mapreduce jobs that I am calling from a java driver. The problem I 
 am facing is that when there is an exception in mapred job, the exception is 
 not propogated to the client so even if first job failed, its going to second 
 job and so on. Is there an another way of catching exceptions from mapred 
 jobs on the client side?
 
 I am using hadoop-0.20.2.
 
 My Example is:
 Driver {
 try {
Call MapredJob1;
Call MapredJob2;
..
..
}catch(Exception e) {
throw new exception;
}
 }
 
 When MapredJob1 throws ClassNotFoundException, MapredJob2 and others are 
 still executing.
 
 Any insight into it is appreciated.
 
 Praveen
 



Lost in HDFS_BYTES_READ/WRITTEN

2011-02-25 Thread maha
Hello, please help me clear me ideas!

  When a reducer reads map-output data remotely ... Is that reflected in the 
HDFS_BYTES_READ?

  Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. 
first data read for maps as input and last data written from reducer as output 
for user to see.


Thank you in advance,

Maha

Re: Lost in HDFS_BYTES_READ/WRITTEN

2011-02-25 Thread Harsh J
From what I could gather, all FileSystem instances put in an entry
into a static 'statistics' map. This map is used to update the
counters for each Task. Hence, all operations done on the same HDFS
URI by either the task or your application code, must be counted as
one. In fact, even if you are reading off another HDFS, the scheme
match is alone seen, so it would aggregate to the same counter as
well.

I'm not very sure of this though. Perhaps writing a simple test should
be adequate to learn the truth.

On Sat, Feb 26, 2011 at 1:04 AM, maha m...@umail.ucsb.edu wrote:
 Hello, please help me clear me ideas!

  When a reducer reads map-output data remotely ... Is that reflected in the 
 HDFS_BYTES_READ?

  Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. 
 first data read for maps as input and last data written from reducer as 
 output for user to see.


 Thank you in advance,

 Maha



-- 
Harsh J
www.harshj.com


Re: Catching mapred exceptions on the client

2011-02-25 Thread Harsh J
Hello,

On Sat, Feb 26, 2011 at 12:24 AM,  praveen.pe...@nokia.com wrote:
 James,
 Thanks for the response. I am using waitForCompletion 
 (job.waitForCompletion(true);) for each job. So the jobs are definitely 
 executed sequentially.

There's a Job.isSuccessful() call you could use, perhaps - after
waitForCompletion(true) returns back.

-- 
Harsh J
www.harshj.com


Error in Reducer stderr preventing reducer from completing

2011-02-25 Thread Aaron Baff
I'm running into an inconsistent problem where with multiple reducers (and 
extremely small amounts of data) I'll hit a situation where a Reducer task 
attempt will hang at 33% until the JobTracker starts up a 2nd attempt. It does 
not seem to be tied to any 1 machine, and when I change the number of reducers 
to 1, it runs fine every time.



log4j:ERROR Failed to flush writer,

java.io.InterruptedIOException

at java.io.FileOutputStream.writeBytes(Native Method)

at java.io.FileOutputStream.write(FileOutputStream.java:260)

at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)

at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)

at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)

at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)

at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)

at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:58)

at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:316)

at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)

at 
org.apache.hadoop.mapred.TaskLogAppender.append(TaskLogAppender.java:58)

at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)

at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)

at org.apache.log4j.Category.callAppenders(Category.java:206)

at org.apache.log4j.Category.forcedLog(Category.java:391)

at org.apache.log4j.Category.log(Category.java:856)

at 
org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:199)

at 
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.freeHost(ShuffleScheduler.java:345)

at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:152)





Aaron Baff | Developer | Telescope, Inc.

email:  aaron.b...@telescope.tvmailto:aaron.b...@telescope.tv | office:  424 
270 2913 | www.telescope.tvhttp://www.telescope.tv


http://twitter.com/Telescope_IncAMERICAN IDOL is back and better than ever 
with a new team of judges on Season 10! Voting begins Tuesday, March 1st so be 
sure to tune in and vote to make sure your favorite contestant makes it through 
each week (check local listings for dates  times). More info at 
www.americanidol.comhttp://www.americanidol.com/.



The information contained in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. Any views expressed in this 
message are those of the individual and may not necessarily reflect the views 
of Telescope Inc. or its associated companies.



Re: Lost in HDFS_BYTES_READ/WRITTEN

2011-02-25 Thread maha
Thanks for your reply Harsh, but this is confusing me more :(

I can't experiment this because I'm using a single machine now and everything 
is reported as Local read/written.

or can I ?
I'm using this line hdfs = FileSystem.get(getConf()); which I think means that 
the instance created is distributed.
but the jobCoutners never uses it for intermediate results (Eg. for reducers to 
read map-outputs)

So if you can answer my question further, I truly appreciate it !  

Maha

On Feb 25, 2011, at 12:00 PM, Harsh J wrote:

 From what I could gather, all FileSystem instances put in an entry
 into a static 'statistics' map. This map is used to update the
 counters for each Task. Hence, all operations done on the same HDFS
 URI by either the task or your application code, must be counted as
 one. In fact, even if you are reading off another HDFS, the scheme
 match is alone seen, so it would aggregate to the same counter as
 well.
 
 I'm not very sure of this though. Perhaps writing a simple test should
 be adequate to learn the truth.
 
 On Sat, Feb 26, 2011 at 1:04 AM, maha m...@umail.ucsb.edu wrote:
 Hello, please help me clear me ideas!
 
  When a reducer reads map-output data remotely ... Is that reflected in the 
 HDFS_BYTES_READ?
 
  Or is HDFS_BYTES_READ/WRITTEN is only for the start and end of a job ? ie. 
 first data read for maps as input and last data written from reducer as 
 output for user to see.
 
 
 Thank you in advance,
 
 Maha
 
 
 
 -- 
 Harsh J
 www.harshj.com



RE: Hadoop build fails

2011-02-25 Thread Tanping Wang
Download jdk1.5
Download apache-forrest
ant -Djava5.home=/PATH_TO_jdk1.5 -Dforrest.home=/PATH_TO_apache-forrest tar

However, suppose what you want is just a binary tar ball without source and 
documentations,
Just do
ant binary

Regards,
Tanping
-Original Message-
From: Mark Kerzner [mailto:markkerz...@gmail.com] 
Sent: Thursday, February 17, 2011 3:37 PM
To: Hadoop Discussion Group
Subject: Hadoop build fails

Hi,

I got the latest trunk out of svn, ran this command

ant -Djavac.args=-Xlint -Xmaxwarns 1000 clean test tar

and got the following error

/hadoop-common-trunk/build.xml:950: 'forrest.home' is not defined. Please
pass-Dforrest.home=base of Apache Forrest installation to Ant on
the command-line.

What should I do? Did I miss something in the documentation?

Thank you,
Mark


Re: is there more smarter way to execute a hadoop cluster?

2011-02-25 Thread JunYoung Kim
hi,

if it is possible, could you give me some examples to load configuration infos?

I've tested it by testing the path of hadoop and hadoop/conf in my $CLASSPATH. 
-- not a solution for me.

how could I load cluster configurations?

thanks.

2011. 2. 25., 오후 3:32, Harsh J 작성:

 Hello again,
 
 Finals won't help all the logic you require to be performed in the
 front-end/Driver code. If you're using fs.default.name inside a Task
 somehow, final will help there. It is best if your application gets
 the right configuration files on its classpath itself, so that the
 right values are read (how else would it know your values!).
 
 Alternatively, you can use GenericOptionsParser to parse -fs and -jt
 arguments when the Driver is launched from commandline.
 
 On Fri, Feb 25, 2011 at 11:46 AM, Jun Young Kim juneng...@gmail.com wrote:
 Hi, Harsh.
 
 I've already tried to do use final tag to set it unmodifiable.
 but, my result is not different.
 
 *core-site.xml:*
 configuration
 property
 namefs.default.name/name
 valuehdfs://localhost/value
 finaltrue/final
 /property
 /configuration
 
 other *-site.xml files are also modified by this rule.
 
 thanks.
 
 Junyoung Kim (juneng...@gmail.com)
 
 
 On 02/25/2011 02:50 PM, Harsh J wrote:
 
 Hi,
 
 On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com
  wrote:
 
 hi,
 
 I got the reason of my problem.
 
 in case of submitting a job by shell,
 
 conf.get(fs.default.name) is hdfs://localhost
 
 in case of submitting a job by a java application directly,
 
 conf.get(fs.default.name) is file://localhost
 so I couldn't read any files from hdfs.
 
 I think the execution of my java app couldn't read *-site.xml
 configurations
 properly.
 
 Have a look at this Q:
 
 http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F
 
 
 
 
 
 -- 
 Harsh J
 www.harshj.com



kmeans

2011-02-25 Thread MANISH SINGLA
Hii
I m facing a problem...I m not able to run the kmeans clustering algo
on a singls node...till now I have just run the wordcount
program...what are the steps in doing so???