Re: How to get the HDFS I/O information

2012-04-24 Thread George Datskos

Qu,

Every job has a history file that is, by default, stored under 
$HADOOP_LOG_DIR/history.  These "job history" files list the amount of 
hdfs read/write (and lots of other things) for every task.


On 2012/04/25 7:25, Qu Chen wrote:
Let me add, I'd like to do this periodically to gather some 
performance profile information.


On Tue, Apr 24, 2012 at 5:47 PM, Qu Chen > wrote:


I am trying to gather the info regarding the amount of HDFS
read/write for each task in a given map-reduce job. How can I do that?








Re: Reducer not firing

2012-04-17 Thread George Datskos

Arko,

Change Iterator to Iterable


George


On 2012/04/18 8:16, Arko Provo Mukherjee wrote:

Hello,

Thanks everyone for helping me. Here are my observations:

Devaraj - I didn't find any bug in the log files. In fact, none of the
print statements in my reducer are even appearing in the logs. I can
share the syslogs if you want. I didn't paste them here so that the
email doesn't get cluttered.

Kasi -  Thanks for the suggestion. I tired but got the same output.
The system just created 1 reducer as my test data set is small.

Bejoy -  Can you please advice how I can pinpoint whether the
IdentityReducer is being used or not.

Steven - I tried compiling with your suggestion. However if I put a
@Override on top of my reduce method, I get the following error:
"method does not override or implement a method from a supertype"
The code compiles without it. I do have an @Override on top of my map
method though.
public class Reduce_First extends Reducer
{
 public void reduce (IntWritable key, Iterator  values,
Context context) throws IOException, InterruptedException
 {
 while ( values.hasNext() )
// Process

 // Finally emit
 }
}

Thanks a lot again!
Warm regards
Arko


On Tue, Apr 17, 2012 at 3:19 PM, Steven Willis  wrote:

Try putting @Override before your reduce method to make sure you're
overriding the method properly. You’ll get a compile time error if not.



-Steven Willis





From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Tuesday, April 17, 2012 10:03 AM


To: mapreduce-user@hadoop.apache.org
Subject: Re: Reducer not firing



Hi Akro
 From the naming of output files, your job has the reduce phase. But the
reducer being used is the IdentityReducer instead of your custom reducer.
That is the reason you are seeing the same map output in the output files as
well. You need to evaluate your code and logs to see why IdentityReducer is
being triggered.

Regards
Bejoy KS

Sent from handheld, please excuse typos.



From: kasi subrahmanyam

Date: Tue, 17 Apr 2012 19:10:33 +0530

To:

ReplyTo: mapreduce-user@hadoop.apache.org

Subject: Re: Reducer not firing



Could you comment the property where you are setting the number of reducer
tasks and see the behaviour of the program once.
If you already tried could you share the output

On Tue, Apr 17, 2012 at 3:00 PM, Devaraj k  wrote:

Can you check the task attempt logs in your cluster and find out what is
happening in the reduce phase. By default task attempt logs present in
$HADOOP_LOG_DIR/userlogs//. There could be some bug exist in your
reducer which is leading to this output.


Thanks
Devaraj


From: Arko Provo Mukherjee [arkoprovomukher...@gmail.com]

Sent: Tuesday, April 17, 2012 2:07 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Reducer not firing


Hello,

Many thanks for the reply.

The 'no_of_reduce_tasks' is set to 2. I have a print statement before
the code I pasted below to check that.

Also I can find two output files part-r-0 and part-r-1. But
they contain the values that has been outputted by the Mapper logic.

Please let me know what I can check further.

Thanks a lot in advance!

Warm regards
Arko

On Tue, Apr 17, 2012 at 12:48 AM, Devaraj k  wrote:

Hi Arko,

What is value of  'no_of_reduce_tasks'?

If no of reduce tasks are 0, then the map task will directly write map
output  into the Job output path.

Thanks
Devaraj


From: Arko Provo Mukherjee [arkoprovomukher...@gmail.com]
Sent: Tuesday, April 17, 2012 10:32 AM
To: mapreduce-user@hadoop.apache.org
Subject: Reducer not firing

Dear All,

I am porting code from the old API to the new API (Context objects)
and run on Hadoop 0.20.203.

Job job_first = new Job();

job_first.setJarByClass(My.class);
job_first.setNumReduceTasks(no_of_reduce_tasks);
job_first.setJobName("My_Job");

FileInputFormat.addInputPath( job_first, new Path (Input_Path) );
FileOutputFormat.setOutputPath( job_first, new Path (Output_Path) );

job_first.setMapperClass(Map_First.class);
job_first.setReducerClass(Reduce_First.class);

job_first.setMapOutputKeyClass(IntWritable.class);
job_first.setMapOutputValueClass(Text.class);

job_first.setOutputKeyClass(NullWritable.class);
job_first.setOutputValueClass(Text.class);

job_first.waitForCompletion(true);

The problem I am facing is that instead of emitting values to
reducers, the mappers are directly writing their output in the
OutputPath and the reducers and not processing anything.

As read from the online materials that are available both my Map and
Reduce method uses the context.write method to emit the values.

Please help. Thanks a lot in advance!!

Warm regards
Arko










Re: Run mapreq job without hadoop script

2012-04-05 Thread George Datskos

Yes you can this.  Take a look at JobClient.runJob and JobClient.submitJob.


George

On 2012/04/06 7:50, GUOJUN Zhu wrote:


We are building a system with hadoop mapreq as the back-end 
distributing-computing engine.  The front-end is also in java.  So it 
will be nice to start a hadoop job, or interact with hdfs directly 
within java, not invoking the hadoop script.  Is there any tutorial or 
guide for this?  Otherwise, I have to read through all the hadoop 
script to simulate the proper setting.  Thank you very much.


Zhu, Guojun
Modeling Sr Graduate
571-3824370
guojun_...@freddiemac.com
Financial Engineering
Freddie Mac 






Re: Best practices configuring libraries on the backend.

2012-03-28 Thread George Datskos

Dmitriy

I've tested it on hadoop 1.0.0 and 1.0.1.  (I don't know which version 
cdh3u3 is based off of)


In hadoop-env.sh if I set 
HADOOP_TASKTRACKER_OPTS="-Djava.library.path=/usr/blah" the TaskTracker 
sees that option.  Then it gets passed along to all M/R child tasks on 
that node.  Can you confirm that your TaskTrackers are actually seeing 
the passed option? (through the ps command)



George


On 2012/03/29 5:19, Dmitriy Lyubimov wrote:

Hm. doesn't seem to work for me (with cdh3u3)
I defined

export HADOOP_TASKTRACKER_OPTS="-Djava.library.path=/usr/"

and it doesn't seem to work (as opposed to when i set is with
property mapred.child.java.opts on the data node).

Still puzzling.

On Tue, Mar 27, 2012 at 7:17 PM, George Datskos
  wrote:

Dmitriy,

I just double-checked, and the caveat I stated earlier is incorrect.  So,
  "-Djava.library.path" set in the client's {mapred.child.java.opts} should
just append to to the "-Djava.library.path" that each TaskTracker has when
creating the library path for each child (M/R) task.  So that's even better
I guess.


George



On 2012/03/28 11:06, George Datskos wrote:

Dmitriy,

To deal with different servers having various shared libraries in
different locations, you can simply make sure the _TaskTracker_'s
-Djava.library.path is set correctly on each server.  That library path
should be passed along to each child (M/R) task.  (in *addition* to the
{mapred.child.java.opts} that you specify on the client-side configuration
options)

One caveat: on the client-side, don't include "-Djava.library.path" or
that path will be passed along to all of the child tasks, overriding
site-specific one you set on the TaskTracker.


George


On 2012/03/28 10:43, Dmitriy Lyubimov wrote:

Hello,

I have a couple of questions regarding mapreduce configurations.

We install various platforms on data nodes that require mixed set of
native libraries.

Part of the problem is that in general case, this software platforms
may be installed into different locations in the backend. (we try to
unify it, but still). What it means, it may require site-specific
-Djava.library.path setting.

I configured individual jvm options (mapred.child.java.opts) on each
node to include specific set of paths. However, i encountered 2
problems:

#1: my setting doesn't go into effect unless I also declare it final
in the data node. It's just being overriden by default -Xmx200 value
from the driver  EVEN when i don't set it on the driver at all (and
there seems to be no way to unset it).

However, using "final" spec at the backend creates  a problem if some
of numerous jobs we run wishes to override the setting still. The
ideal behavior is if i don't set it in the driver, then backend value
kicks in, otherwise it's driver's value. But i did not find a way to
do that for this particular setting for some reason.Could somebody
clarify the best workaround? thank you.

#2. Ideal behavior would actually be to merge driver-specific and
backend-specific settings. E.g. backend may need to configure specific
software package locations while client may wish sometimes to set heap
etc. Is there a best practice to achieve this effect?

Thank you very much in advance.
-Dmitriy















Re: Best practices configuring libraries on the backend.

2012-03-27 Thread George Datskos

Dmitriy,

I just double-checked, and the caveat I stated earlier is incorrect.  
So,  "-Djava.library.path" set in the client's {mapred.child.java.opts} 
should just append to to the "-Djava.library.path" that each TaskTracker 
has when creating the library path for each child (M/R) task.  So that's 
even better I guess.



George


On 2012/03/28 11:06, George Datskos wrote:

Dmitriy,

To deal with different servers having various shared libraries in 
different locations, you can simply make sure the _TaskTracker_'s 
-Djava.library.path is set correctly on each server.  That library 
path should be passed along to each child (M/R) task.  (in *addition* 
to the {mapred.child.java.opts} that you specify on the client-side 
configuration options)


One caveat: on the client-side, don't include "-Djava.library.path" or 
that path will be passed along to all of the child tasks, overriding 
site-specific one you set on the TaskTracker.



George


On 2012/03/28 10:43, Dmitriy Lyubimov wrote:

Hello,

I have a couple of questions regarding mapreduce configurations.

We install various platforms on data nodes that require mixed set of
native libraries.

Part of the problem is that in general case, this software platforms
may be installed into different locations in the backend. (we try to
unify it, but still). What it means, it may require site-specific
-Djava.library.path setting.

I configured individual jvm options (mapred.child.java.opts) on each
node to include specific set of paths. However, i encountered 2
problems:

#1: my setting doesn't go into effect unless I also declare it final
in the data node. It's just being overriden by default -Xmx200 value
from the driver  EVEN when i don't set it on the driver at all (and
there seems to be no way to unset it).

However, using "final" spec at the backend creates  a problem if some
of numerous jobs we run wishes to override the setting still. The
ideal behavior is if i don't set it in the driver, then backend value
kicks in, otherwise it's driver's value. But i did not find a way to
do that for this particular setting for some reason.Could somebody
clarify the best workaround? thank you.

#2. Ideal behavior would actually be to merge driver-specific and
backend-specific settings. E.g. backend may need to configure specific
software package locations while client may wish sometimes to set heap
etc. Is there a best practice to achieve this effect?

Thank you very much in advance.
-Dmitriy












Re: Best practices configuring libraries on the backend.

2012-03-27 Thread George Datskos

Dmitriy,

To deal with different servers having various shared libraries in 
different locations, you can simply make sure the _TaskTracker_'s 
-Djava.library.path is set correctly on each server.  That library path 
should be passed along to each child (M/R) task.  (in *addition* to the 
{mapred.child.java.opts} that you specify on the client-side 
configuration options)


One caveat: on the client-side, don't include "-Djava.library.path" or 
that path will be passed along to all of the child tasks, overriding 
site-specific one you set on the TaskTracker.



George


On 2012/03/28 10:43, Dmitriy Lyubimov wrote:

Hello,

I have a couple of questions regarding mapreduce configurations.

We install various platforms on data nodes that require mixed set of
native libraries.

Part of the problem is that in general case, this software platforms
may be installed into different locations in the backend. (we try to
unify it, but still). What it means, it may require site-specific
-Djava.library.path setting.

I configured individual jvm options (mapred.child.java.opts) on each
node to include specific set of paths. However, i encountered 2
problems:

#1: my setting doesn't go into effect unless I also declare it final
in the data node. It's just being overriden by default -Xmx200 value
from the driver  EVEN when i don't set it on the driver at all (and
there seems to be no way to unset it).

However, using "final" spec at the backend creates  a problem if some
of numerous jobs we run wishes to override the setting still. The
ideal behavior is if i don't set it in the driver, then backend value
kicks in, otherwise it's driver's value. But i did not find a way to
do that for this particular setting for some reason.Could somebody
clarify the best workaround? thank you.

#2. Ideal behavior would actually be to merge driver-specific and
backend-specific settings. E.g. backend may need to configure specific
software package locations while client may wish sometimes to set heap
etc. Is there a best practice to achieve this effect?

Thank you very much in advance.
-Dmitriy







Re: Mapper Record Spillage

2012-03-12 Thread George Datskos
Actually if you set {io.sort.mb} to 2048, your map tasks will always 
fail.  The maximum {io.sort.mb} is hard-coded to 2047.  Which means if 
you think you've set 2048 and your tasks aren't failing, then you 
probably haven't actually changed io.sort.mb.  Double-check what 
configuration settings the Jobtracker actually saw by looking at


$ hadoop fs -cat hdfs:///_logs/history/*.xml | grep 
io.sort.mb




George


On 2012/03/11 22:38, Harsh J wrote:

Hans,

I don't think io.sort.mb can support a whole 2048 value (it builds one
array with the size, and JVM may not be allowing that). Can you lower
it to 2000 ± 100 and try again?

On Sun, Mar 11, 2012 at 1:36 PM, Hans Uhlig  wrote:

If that is the case then these two lines should make more than enough
memory. On a virtually unused cluster.

job.getConfiguration().setInt("io.sort.mb", 2048);
job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx3072M");

Such that a conversion from 1GB of CSV Text to binary primitives should fit
easily. but java still throws a heap error even when there is 25 GB of
memory free.

On Sat, Mar 10, 2012 at 11:50 PM, Harsh J  wrote:

Hans,

You can change memory requirements for tasks of a single job, but not
of a single task inside that job.

This is briefly how the 0.20 framework (by default) works: TT has
notions only of "slots", and carries a maximum _number_ of
simultaneous slots it may run. It does not know of what each task,
occupying one slot, would demand in resource-terms. Your job then
supplies a # of map tasks, and amount of memory required per map task
in general, as a configuration. TTs then merely start the task JVMs
with the provided heap configuration.

On Sun, Mar 11, 2012 at 11:24 AM, Hans Uhlig  wrote:

That was a typo in my email not in the configuration. Is the memory
reserved
for the tasks when the task tracker starts? You seem to be suggesting
that I
need to set the memory to be the same for all map tasks. Is there no way
to
override for a single map task?


On Sat, Mar 10, 2012 at 8:41 PM, Harsh J  wrote:

Hans,

Its possible you may have an typo issue: mapred.map.child.jvm.opts -
Such a property does not exist. Perhaps you wanted
"mapred.map.child.java.opts"?

Additionally, the computation you need to do is (# of map slots on a
TT * per-map-task-heap-requirement) should be at least<  (Total RAM -
2/3 GB). With your 4 GB requirement, I guess you can support a max of
6-7 slots per machine (i.e. Not counting reducer heap requirements in
parallel).

On Sun, Mar 11, 2012 at 9:30 AM, Hans Uhlig  wrote:

I am attempting to speed up a mapping process whose input is GZIP
compressed
CSV files. The files range from 1-2GB, I am running on a Cluster
where
each
node has a total of 32GB memory available to use. I have attempted to
tweak
mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to
accommodate the size but I keep getting java heap errors or other
memory
related problems. My row count per mapper is well below
Integer.MAX_INTEGER
limit by several orders of magnitude and the box is NOT using
anywhere
close
to its full memory allotment. How can I specify that this map task
can
have
3-4 GB of memory for the collection, partition and sort process
without
constantly spilling records to disk?



--
Harsh J





--
Harsh J










Re: Tracking Job completion times

2012-03-04 Thread George Datskos

Bharath,

Try the hadoop job -history API



On 2012/03/05 8:06, Bharath Ravi wrote:
The Web UI does give me start and finish times, but I was wondering if 
there is
a way to access these stats through an API, without having to grep 
through HTML.


The "hadoop jobs -status" API was useful, but it doesn't seem to list 
wall completion times.

(It does give me CPU time though). Am I missing something?