Pangool: easier Hadoop, same performance

2012-03-06 Thread Pere Ferrera
Hi,
I'd like to introduce you Pangool , an easier
low-level MapReduce API for Hadoop. I'm one of the developers. We just
open-sourced it yesterday.

Pangool is a Java, low-level MapReduce API with the same flexibility and
performance than the plain Java Hadoop MapReduce API. The difference is
that it makes a lot of things easier to code and understand.

A few of Pangool's features:
- Tuple-based intermediate serialization (allowing easier development).
- Built-in, easy-to-use group by and sort by (removing boilerplate code for
things like secondary sort).
- Built-in, easy-to-use reduce-side joins (which are quite hard to
implement in Hadoop).
- Augmented Hadoop API: Built-in multiple inputs / outputs, configuration
via object instance.

Pangool meets the need of making Hadoop's steep learning curve a lot
smoother while retaining all its features, power and flexibility. It
differs in high-level tools like Pig or Hive in that it can be used as a
replacement of the low-level API. There is no performance / flexibility
penalty paid for using Pangool.

We did an initial benchmark  to show
this idea.

I'd be very interested in hearing your feedback, opinions and questions on
it.

Cheers,

Pere.


Re: why does my mapper class reads my input file twice?

2012-03-06 Thread Jane Wayne
Harsh,

Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and
it is as you stated. That make sense now. I simply commented out
FileInputFormat.addInputPath(job, input)
 and FileOutputFormat.setOutputPath(job, output) and everything
automagically works now.

Thanks a bunch!

On Tue, Mar 6, 2012 at 2:06 AM, Harsh J  wrote:

> Its your use of the mapred.input.dir property, which is a reserved
> name in the framework (its what FileInputFormat uses).
>
> You have a config you extract path from:
> Path input = new Path(conf.get("mapred.input.dir"));
>
> Then you do:
> FileInputFormat.addInputPath(job, input);
>
> Which internally, simply appends a path to a config prop called
> "mapred.input.dir". Hence your job gets launched with two input files
> (the very same) - one added by default Tool-provided configuration
> (cause of your -Dmapred.input.dir) and the other added by you.
>
> Fix the input path line to use a different config:
> Path input = new Path(conf.get("input.path"));
>
> And run job as:
> hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
> -Dmapred.output.dir=result
>
> On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne 
> wrote:
> > i have code that reads in a text file. i notice that each line in the
> text
> > file is somehow being read twice. why is this happening?
> >
> > my mapper class looks like the following:
> >
> > public class MyMapper extends Mapper > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyMapper.class);
> >  @Override
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> > String s = (new
> > StringBuilder()).append(value.toString()).append("m").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> >
> > my reducer class looks like the following:
> >
> > public class MyReducer extends Reducer > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyReducer.class);
> >  @Override
> > public void reduce(LongWritable key, Iterable values, Context
> > context) throws IOException, InterruptedException {
> > for(Iterator it = values.iterator(); it.hasNext();) {
> > Text txt = it.next();
> > String s = (new
> > StringBuilder()).append(txt.toString()).append("r").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> > }
> >
> > my job class looks like the following:
> >
> > public class MyJob extends Configured implements Tool {
> >
> > public static void main(String[] args) throws Exception {
> > ToolRunner.run(new Configuration(), new MyJob(), args);
> > }
> >
> > @Override
> > public int run(String[] args) throws Exception {
> > Configuration conf = getConf();
> > Path input = new Path(conf.get("mapred.input.dir"));
> >Path output = new Path(conf.get("mapred.output.dir"));
> >
> >Job job = new Job(conf, "dummy job");
> >job.setMapOutputKeyClass(LongWritable.class);
> >job.setMapOutputValueClass(Text.class);
> >job.setOutputKeyClass(LongWritable.class);
> >job.setOutputValueClass(Text.class);
> >
> >job.setMapperClass(MyMapper.class);
> >job.setReducerClass(MyReducer.class);
> >
> >FileInputFormat.addInputPath(job, input);
> >FileOutputFormat.setOutputPath(job, output);
> >
> >job.setJarByClass(MyJob.class);
> >
> >return job.waitForCompletion(true) ? 0 : 1;
> > }
> > }
> >
> > the text file that i am trying to read in looks like the following. as
> you
> > can see, there are 9 lines.
> >
> > T, T
> > T, T
> > T, T
> > F, F
> > F, F
> > F, F
> > F, F
> > T, F
> > F, T
> >
> > the output file that i get after my Job runs looks like the following. as
> > you can see, there are 18 lines. each key is emitted twice from the
> mapper
> > to the reducer.
> >
> > 0   T, Tmr
> > 0   T, Tmr
> > 6   T, Tmr
> > 6   T, Tmr
> > 12  T, Tmr
> > 12  T, Tmr
> > 18  F, Fmr
> > 18  F, Fmr
> > 24  F, Fmr
> > 24  F, Fmr
> > 30  F, Fmr
> > 30  F, Fmr
> > 36  F, Fmr
> > 36  F, Fmr
> > 42  T, Fmr
> > 42  T, Fmr
> > 48  F, Tmr
> > 48  F, Tmr
> >
> > the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
> >
> > hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
> > -Dmapred.output.dir=result
> >
> > originally, this happened when i read in a sequence file, but even for a
> > text file, this problem is still happening. is it the way i have setup my
> > Job?
>
>
>
> --
> Harsh J
>


Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Joey Echeverria
I think you mean Writer.getLength(). It returns the current position
in the output stream in bytes (more or less the current size of the
file).

-Joey

On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne  wrote:
> hi,
>
> i am writing a little util class to recurse into a directory and add all
> *.txt files into a sequence file (key is the file name, value is the
> content of the corresponding text file). as i am writing (i.e.
> SequenceFile.Writer.append(key, value)), is there any way to detect how
> large the sequence file is?
>
> for example, i want to create a new sequence file as soon as the current
> one exceeds 64 MB.
>
> i notice there is a SequenceFile.Writer.getLong() which the javadocs says
> "returns the current length of the output file," but that is vague. what is
> this Writer.getLong() method? is it the number of bytes, kilobytes,
> megabytes, or something else?
>
> thanks,



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: is there anyway to detect the file size as am i writing a sequence file?

2012-03-06 Thread Jane Wayne
Thanks Joey. That's what I meant (I've been staring at the screen too
long). :)

On Tue, Mar 6, 2012 at 10:00 AM, Joey Echeverria  wrote:

> I think you mean Writer.getLength(). It returns the current position
> in the output stream in bytes (more or less the current size of the
> file).
>
> -Joey
>
> On Tue, Mar 6, 2012 at 9:53 AM, Jane Wayne 
> wrote:
> > hi,
> >
> > i am writing a little util class to recurse into a directory and add all
> > *.txt files into a sequence file (key is the file name, value is the
> > content of the corresponding text file). as i am writing (i.e.
> > SequenceFile.Writer.append(key, value)), is there any way to detect how
> > large the sequence file is?
> >
> > for example, i want to create a new sequence file as soon as the current
> > one exceeds 64 MB.
> >
> > i notice there is a SequenceFile.Writer.getLong() which the javadocs says
> > "returns the current length of the output file," but that is vague. what
> is
> > this Writer.getLong() method? is it the number of bytes, kilobytes,
> > megabytes, or something else?
> >
> > thanks,
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


how to get rid of -libjars ?

2012-03-06 Thread Jane Wayne
currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the following.

hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a comma-delimited list
of jars for -libjars.

any help is appreciated.


Re: how to get rid of -libjars ?

2012-03-06 Thread Joey Echeverria
If you're using -libjars, there's no reason to copy the jars into
$HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if
you use them from your main() method:

export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar
hadoop jar main.jar demo.MyJob -libjars
dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path
-Dmapred.output.dir=/output/path

-Joey

On Tue, Mar 6, 2012 at 10:37 AM, Jane Wayne  wrote:
> currently, i have my main jar and then 2 depedent jars. what i do is
> 1. copy dependent-1.jar to $HADOOP/lib
> 2. copy dependent-2.jar to $HADOOP/lib
>
> then, when i need to run my job, MyJob inside main.jar, i do the following.
>
> hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
> -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path
>
> what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
> specify -libjars. is there any way around this multi-step procedure? i
> really do not want to clutter $HADOOP/lib or specify a comma-delimited list
> of jars for -libjars.
>
> any help is appreciated.



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: how to get rid of -libjars ?

2012-03-06 Thread Ioan Eugen Stan

Pe 06.03.2012 17:37, Jane Wayne a scris:

currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the following.

hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and always
specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a comma-delimited list
of jars for -libjars.

any help is appreciated.



Hello,

Specify the full path to the jar on the -libjars? My experience with 
-libjars is that it didn't work as advertised.


Search for an older post on the list about this issue ( -libjars not 
working). I tried adding a lot of jars and some got on the job classpath 
(2), some didn't (most of them).


I got over this by including all the jars in a lib directory inside the 
main jar.


Cheers,
--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: how to get rid of -libjars ?

2012-03-06 Thread Bejoy Ks
Hi Jane

+ Adding on to Joey's comments

If you want to eliminate the process of distributing the dependent
jars every time, then you need to manually pre-distribute these jars across
the nodes and add them on to the classpath of all nodes. This approach may
be chosen if you are periodically running some job at a greater frequency
on your cluster that needs external jars.

Regards
Bejoy.K.S

On Tue, Mar 6, 2012 at 9:23 PM, Joey Echeverria  wrote:

> If you're using -libjars, there's no reason to copy the jars into
> $HADOOP lib. You may have to add the jars to the HADOOP_CLASSPATH if
> you use them from your main() method:
>
> export HADOOP_CLASSPATH=dependent-1.jar,dependent-2.jar
> hadoop jar main.jar demo.MyJob -libjars
> dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path
> -Dmapred.output.dir=/output/path
>
> -Joey
>
> On Tue, Mar 6, 2012 at 10:37 AM, Jane Wayne 
> wrote:
> > currently, i have my main jar and then 2 depedent jars. what i do is
> > 1. copy dependent-1.jar to $HADOOP/lib
> > 2. copy dependent-2.jar to $HADOOP/lib
> >
> > then, when i need to run my job, MyJob inside main.jar, i do the
> following.
> >
> > hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
> > -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path
> >
> > what i want to do is NOT copy the dependent jars to $HADOOP/lib and
> always
> > specify -libjars. is there any way around this multi-step procedure? i
> > really do not want to clutter $HADOOP/lib or specify a comma-delimited
> list
> > of jars for -libjars.
> >
> > any help is appreciated.
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


Re: how to get rid of -libjars ?

2012-03-06 Thread Shi Yu
1.  Wrap all your jar files inside your artifact,  they should be under 
lib  folder.  Sometimes this could make your jar file quite big, if you 
want to save time uploading big jar files remotely,  see 2
2.  Use -libjars with full path or relative path (w.r.t. your jar 
package) should work


On 3/6/2012 9:55 AM, Ioan Eugen Stan wrote:

Pe 06.03.2012 17:37, Jane Wayne a scris:

currently, i have my main jar and then 2 depedent jars. what i do is
1. copy dependent-1.jar to $HADOOP/lib
2. copy dependent-2.jar to $HADOOP/lib

then, when i need to run my job, MyJob inside main.jar, i do the 
following.


hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar
-Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path

what i want to do is NOT copy the dependent jars to $HADOOP/lib and 
always

specify -libjars. is there any way around this multi-step procedure? i
really do not want to clutter $HADOOP/lib or specify a 
comma-delimited list

of jars for -libjars.

any help is appreciated.



Hello,

Specify the full path to the jar on the -libjars? My experience with 
-libjars is that it didn't work as advertised.


Search for an older post on the list about this issue ( -libjars not 
working). I tried adding a lot of jars and some got on the job 
classpath (2), some didn't (most of them).


I got over this by including all the jars in a lib directory inside 
the main jar.


Cheers,




HDFS Reporting Tools

2012-03-06 Thread Oren Livne

Dear All,

We are maintaining a 60-node hadoop cluster for external users, and 
would like to be automatically notified via email when an HDFS crash or 
some other infrastructure failure occurs that is not due to a user 
programming error. We've been encountering such "soft" errors, where 
hadoop does not crash, but becomes very slow and job hand for a long 
time and fail.


Are there existing tools that provide this capability? Or do we have to 
manually monitor the web services at on http://namenode and 
http://namenode:50030?


Thank you so much,
Oren

--
"We plan ahead, which means we don't do anything right now."
  -- Valentine (Tremors)

--
"We plan ahead, which means we don't do anything right now."
  -- Valentine (Tremors)



Hadoop EC2 user-data script

2012-03-06 Thread Sagar Nikam
Hi,

I am new to Hadoop. I am want to try hadoop installation using Openstack.
Openstack API for launching instance(VM) has a parameter for passing
user-data. Here we can pass scripts which will be executed on first time
boot.

This is similar to EC2 user-data.

I would like to know about the hadoop user-data script. Any help on this is
appreciated.

Thanks in advance.

Regards,
Sagar


Re: Java Heap space error

2012-03-06 Thread Mohit Anchlia
I am still trying to see how to narrow this down. Is it possible to set
heapdumponoutofmemoryerror option on these individual tasks?

On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia wrote:

> Sorry for multiple emails. I did find:
>
>
> 2012-03-05 17:26:35,636 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
> Usage threshold init = 715849728(699072K) used = 575921696(562423K)
> committed = 715849728(699072K) max = 715849728(699072K)
>
> 2012-03-05 17:26:35,719 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
> 7816154 bytes from 1 objects. init = 715849728(699072K) used =
> 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)
>
> 2012-03-05 17:26:36,881 INFO
> org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
> - Collection threshold init = 715849728(699072K) used = 358720384(350312K)
> committed = 715849728(699072K) max = 715849728(699072K)
>
> 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
>
> 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error
> running child : java.lang.OutOfMemoryError: Java heap space
>
> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:39)
>
> at java.nio.CharBuffer.allocate(CharBuffer.java:312)
>
> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)
>
> at org.apache.hadoop.io.Text.decode(Text.java:350)
>
> at org.apache.hadoop.io.Text.decode(Text.java:327)
>
> at org.apache.hadoop.io.Text.toString(Text.java:254)
>
> at
> org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)
>
> at
> org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
>
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
>
> at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>
> at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:396)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
>
> at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
>
>   On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia wrote:
>
>> All I see in the logs is:
>>
>>
>> 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
>> attempt_201203051722_0001_m_30_1 - Killed : Java heap space
>>
>> Looks like task tracker is killing the tasks. Not sure why. I increased
>> heap from 512 to 1G and still it fails.
>>
>>
>> On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia wrote:
>>
>>> I currently have java.opts.mapred set to 512MB and I am getting heap
>>> space errors. How should I go about debugging heap space issues?
>>>
>>
>>
>


hadoop cluster ssh username

2012-03-06 Thread Pat Ferrel
I have a small cluster of servers that runs hadoop. I have a laptop that 
I'd like to use that cluster when it is available. I setup hadoop on the 
laptop so I can switch from running local to running on the cluster. 
Local works. I have setup passwordless ssh between all machines to work 
with 'hadoop-user' which is the linux username on the cluster machines 
so I can ssh from the laptop to the servers without a password thusly:


ssh hadoop-user@master

But my username on the laptop is pferrel, not hadoop-user so when 
running 'start-all.sh' it tries


ssh pferrel@master

How do I tell it to use the linux user 'hadoop-user'. I assume there is 
something in the config directory xml files that will do this?


Hadoop error=12 Cannot allocate memory

2012-03-06 Thread Rohini U
Hi,

I have a hadoop cluster of size 5 and a data of size 1GB. I am running
a simple map reduce program which reads text data and outputs
 sequence files.
I found some solutions to this problem suggesting to set over commmit
to 0 and to increase the unlimit.
I have memory over commit set to 0 and have ulimit unlimited. Even
with this, I keep getting the following error. Is any one aware
of any work arounds for this?
java.io.IOException: Cannot run program "bash": java.io.IOException:

*Error:
*
*error*=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:734)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:694)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)



Thanks
-Rohini


Re: hadoop cluster ssh username

2012-03-06 Thread Harsh J
Pat,

Your question seems to be in multiple parts if am right?

1. How do you manage configuration so that switching between local and
wider-cluster mode both work?

My suggestion would be to create two git branches in your conf
directory and switch them as you need, with simple git checkouts.

2. How do you get the start/stop scripts to ssh as hadoop-user instead
of using your user name?

In your masters and slaves file, instead of placing just a list of
"hostnames", place "hadoop-user@hostnames". That should do the trick.

If you want your SSH itself to use a different username when being
asked to connect to a hostname, follow the per-host configuration to
specify a username to automatically pick when provided a hostname:
http://technosophos.com/content/ssh-host-configuration

On Tue, Mar 6, 2012 at 11:42 PM, Pat Ferrel  wrote:
> I have a small cluster of servers that runs hadoop. I have a laptop that I'd
> like to use that cluster when it is available. I setup hadoop on the laptop
> so I can switch from running local to running on the cluster. Local works. I
> have setup passwordless ssh between all machines to work with 'hadoop-user'
> which is the linux username on the cluster machines so I can ssh from the
> laptop to the servers without a password thusly:
>
> ssh hadoop-user@master
>
> But my username on the laptop is pferrel, not hadoop-user so when running
> 'start-all.sh' it tries
>
> ssh pferrel@master
>
> How do I tell it to use the linux user 'hadoop-user'. I assume there is
> something in the config directory xml files that will do this?



-- 
Harsh J


Re: getting NullPointerException while running Word cont example

2012-03-06 Thread Harsh J
Hi Sujit,

Please also tell us which version/distribution of Hadoop is this?

On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale  wrote:
> Hi,
>
> I am new to Hadoop., i install Hadoop as per
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
>
>
> while running Word cont example i am getting *NullPointerException
>
> *can some one please look in to this issue ?*
>
> *Thanks in Advance*  !!!
>
> *
>
>
> duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
> Found 3 items
> -rw-r--r--   1 hduser supergroup     674566 2012-03-06 23:04
> /user/hduser/data/pg20417.txt
> -rw-r--r--   1 hduser supergroup    1573150 2012-03-06 23:04
> /user/hduser/data/pg4300.txt
> -rw-r--r--   1 hduser supergroup    1423801 2012-03-06 23:04
> /user/hduser/data/pg5000.txt
>
> hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
> wordcount /user/hduser/data /user/hduser/gutenberg-outputd
>
> 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to process
> : 3
> 12/03/06 23:14:33 INFO mapred.JobClient: Running job: job_201203062221_0002
> 12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
> 12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
> 12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
> 12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
> attempt_201203062221_0002_r_00_0, Status : FAILED
> Error: java.lang.NullPointerException
>    at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
>
> 12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
> attempt_201203062221_0002_r_00_1, Status : FAILED
> Error: java.lang.NullPointerException
>    at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
>
> 12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
> attempt_201203062221_0002_r_00_2, Status : FAILED
> Error: java.lang.NullPointerException
>    at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
>    at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
>
> 12/03/06 23:15:31 INFO mapred.JobClient: Job complete: job_201203062221_0002
> 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
> 12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
> 12/03/06 23:15:31 INFO mapred.JobClient:     Launched reduce tasks=4
> 12/03/06 23:15:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22084
> 12/03/06 23:15:31 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 12/03/06 23:15:31 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 12/03/06 23:15:31 INFO mapred.JobClient:     Launched map tasks=3
> 12/03/06 23:15:31 INFO mapred.JobClient:     Data-local map tasks=3
> 12/03/06 23:15:31 INFO mapred.JobClient:     Failed reduce tasks=1
> 12/03/06 23:15:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16799
> 12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
> 12/03/06 23:15:31 INFO mapred.JobClient:     FILE_BYTES_READ=740520
> 12/03/06 23:15:31 INFO mapred.JobClient:     HDFS_BYTES_READ=3671863
> 12/03/06 23:15:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2278287
> 12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
> 12/03/06 23:15:31 INFO mapred.JobClient:     Bytes Read=3671517
> 12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
> 12/03/06 23:15:31 INFO mapred.JobClient:     Map output materialized
> bytes=1474341
> 12/03/06 23:15:31 INFO mapred.JobClient:     Combine output records=102322
> 12/03/06 23:15:31 INFO mapred.JobClient:     Map input records=77932
> 12/03/06 23:15:31 INFO mapred.JobClient:     Spilled Records=153640
> 12/03/06 23:15:31 INFO mapred.JobClient:     Map output bytes=6076095
> 12/03/06 23:15:31 INFO mapred.JobClient:     Combine input records=629172
> 12/03/06 23:15:31 INFO mapred.JobClient:     Map output records=629172
> 12/03/06 23:15:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=346
> hduser@sujit:~/Desktop/hadoop$



-- 
Harsh J


Hadoop runtime metrics

2012-03-06 Thread en-hui chang
Hi All,

We have a medium cluster running 70 nodes, using  0.20.2-cdh3u1. We collect 
run-time metrics thru Ganglia. We found that the certain metrics like  
waiting_reduces , tasks_failed_timeout is high and looks the values are getting 
cumulative. Any thoughts on this will be helpful.

Thanks


Re: how is userlogs supposed to be cleaned up?

2012-03-06 Thread Arun C Murthy

On Mar 6, 2012, at 10:22 AM, Chris Curtin wrote:

> Hi,
> 
> We had a fun morning trying to figure out why our cluster was failing jobs,
> removing nodes from the cluster etc. The majority of the errors were
> something like:
> 
[snip]

> We are running CDH3u3.

You'll need to check with CDH lists. 

However, hadoop-1.0 (and prior, starting with hadoop-0.20.203) have mechanisms 
to clean up userlogs automatically; else, as you've found out, operating large 
clusters (4k nodes) with millions of jobs per month is too painful.

Arun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Hadoop runtime metrics

2012-03-06 Thread Arun C Murthy

On Mar 6, 2012, at 10:54 AM, en-hui chang wrote:

> Hi All,
> 
> We have a medium cluster running 70 nodes, using  0.20.2-cdh3u1. We collect 
> run-time metrics thru Ganglia. We found that the certain metrics like  
> waiting_reduces , tasks_failed_timeout is high and looks the values are 
> getting cumulative. Any thoughts on this will be helpful.


You'll need to check with CDH lists.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: HDFS Reporting Tools

2012-03-06 Thread Jamack, Peter
You could set up things like Ganglia, Nagios to monitor, send off events,
issues.
Within the Hadoop Ecosystem, there are things like Vaidya, maybe
Ambari(not sure as I've not used this), Splunk even has a new beta test
for Shep/Splunk Hadoop Monitoring app.

Peter Jamack

On 3/6/12 8:35 AM, "Oren Livne"  wrote:

>Dear All,
>
>We are maintaining a 60-node hadoop cluster for external users, and
>would like to be automatically notified via email when an HDFS crash or
>some other infrastructure failure occurs that is not due to a user
>programming error. We've been encountering such "soft" errors, where
>hadoop does not crash, but becomes very slow and job hand for a long
>time and fail.
>
>Are there existing tools that provide this capability? Or do we have to
>manually monitor the web services at on http://namenode and
>http://namenode:50030?
>
>Thank you so much,
>Oren
>
>-- 
>"We plan ahead, which means we don't do anything right now."
>   -- Valentine (Tremors)
>
>-- 
>"We plan ahead, which means we don't do anything right now."
>   -- Valentine (Tremors)
>



Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-06 Thread Russell Jurney
Rules of thumb IMO:

You should be using Pig in place of MR jobs at all times that performance
isn't absolutely crucial.  Writing unnecessary MR is needless technical
debt that you will regret as people are replaced and your organization
scales.  Pig gets it done in much less time.  If you need faster jobs, then
optimize your Pig, and if that doesn't work, put a single
MAPREDUCE job
at the bottleneck.  Also, realize that it can be hard to actually beat
Pig's performance without experience.  Check that your MR job is actually
faster than Pig at the same load before assuming you can do better than Pig.

Streaming is good if your data doesn't easily map to tuples, you really
like using the abstractions of your favoriate language's MR library, or you
are doing something weird like simulations/pure batch jobs (no mR).

If you're doing a lot of joins and performance is a problem - consider
doing fewer joins.  I would strongly suggest that you prioritize
de-normalizing and duplicating data over switching to raw MR jobs because
HIVE joins are slow.  MapReduce is slow at joins.  Programmer time is more
valuable than machine time.  If you're having to write tons of raw MR, then
get more machines.

On Fri, Mar 2, 2012 at 6:21 AM, Subir S  wrote:

> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:
>
> > On Fri, Mar 2, 2012 at 10:18 AM, Subir S 
> > wrote:
> > > Hello Folks,
> > >
> > > Are there any pointers to such comparisons between Apache Pig and
> Hadoop
> > > Streaming Map Reduce jobs?
> >
> > I do not see why you seek to compare these two. Pig offers a language
> > that lets you write data-flow operations and runs these statements as
> > a series of MR jobs for you automatically (Making it a great tool to
> > use to get data processing done really quick, without bothering with
> > code), while streaming is something you use to write non-Java, simple
> > MR jobs. Both have their own purposes.
> >
>
> Basically we are comparing these two to see the benefits and how much they
> help in improving the productive coding time, without jeopardizing the
> performance of MR jobs.
>
>
> > > Also there was a claim in our company that Pig performs better than Map
> > > Reduce jobs? Is this true? Are there any such benchmarks available
> >
> > Pig _runs_ MR jobs. It does do job design (and some data)
> > optimizations based on your queries, which is what may give it an edge
> > over designing elaborate flows of plain MR jobs with tools like
> > Oozie/JobControl (Which takes more time to do). But regardless, Pig
> > only makes it easy doing the same thing with Pig Latin statements for
> > you.
> >
>
> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
> pretty slow with lot of joins, which we can achieve faster with writing raw
> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
> for example what kind of projects should consider Pig. Say when we have a
> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>
> Thank you Harsh for your comments. They are helpful!
>
>
> >
> > --
> > Harsh J
> >
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: hadoop cluster ssh username

2012-03-06 Thread Harsh J
Pat,

On Wed, Mar 7, 2012 at 4:10 AM, Pat Ferrel  wrote:
> Thanks, #2 below gets me partway.
>
> I can start-all.sh and stop-all.sh from the laptop and can fs -ls but
> copying gives me:
>
> Maclaurin:mahout-distribution-0.6 pferrel$ fs -copyFromLocal
> wikipedia-seqfiles/ wikipedia-seqfiles/
> 2012-03-06 13:45:04.225 java[7468:1903] Unable to load realm info from
> SCDynamicStore
> copyFromLocal: org.apache.hadoop.security.AccessControlException: Permission
> denied: user=pferrel, access=WRITE, inode="user":pat:supergroup:rwxr-xr-x

This seems like a totally different issue now, and deals with HDFS
permissions not cluster start/stop.

Yes, you have some files created (or some daemons running) with
username pat, while you try to access now as pferrel (your local
user). This you can't work around against or evade and will need to
fix via "hadoop fs -chmod/-chown" and such. You can disable
permissions if you do not need it though, simply set dfs.permissions
to false in NameNode's hdfs-site.xml and restart NN.

-- 
Harsh J


Fair Scheduler Problem

2012-03-06 Thread hao.wang
Hi ,All,
I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
the fair Scheduler I find the scheduler seems  not support preemption.  
Can anybody tell me whether preemption is supported in this version?
This is my configration:
 mapred-site.xml   
 
  mapred.jobtracker.taskScheduler 
  org.apache.hadoop.mapred.FairScheduler 
 

  mapred.fairscheduler.allocation.file
  /usr/lib/hadoop-0.20/conf/fair-scheduler.xml


mapred.fairscheduler.preemption
true


mapred.fairscheduler.preemption.only.log
true


mapred.fairscheduler.preemption.interval
15000


  mapred.fairscheduler.weightadjuster
  org.apache.hadoop.mapred.NewJobWeightBooster


  mapred.fairscheduler.sizebasedweight
  true

fair-scheduler.xml 

   
  10
5
200
   80
   100
  30
1.0
  
  
   10
5
   80
   80
5
   30
   1.0
  
  
   10
  
20
   10
   30
   30


regards,

2012-03-07 



hao.wang 


Re: getting NullPointerException while running Word cont example

2012-03-06 Thread Sujit Dhamale
Hadoop version : hadoop-0.20.203.0rc1.tar
Operaring Syatem : ubuntu 11.10


On Wed, Mar 7, 2012 at 12:19 AM, Harsh J  wrote:

> Hi Sujit,
>
> Please also tell us which version/distribution of Hadoop is this?
>
> On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale 
> wrote:
> > Hi,
> >
> > I am new to Hadoop., i install Hadoop as per
> >
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
> <
> http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste
> >
> >
> >
> > while running Word cont example i am getting *NullPointerException
> >
> > *can some one please look in to this issue ?*
> >
> > *Thanks in Advance*  !!!
> >
> > *
> >
> >
> > duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
> > Found 3 items
> > -rw-r--r--   1 hduser supergroup 674566 2012-03-06 23:04
> > /user/hduser/data/pg20417.txt
> > -rw-r--r--   1 hduser supergroup1573150 2012-03-06 23:04
> > /user/hduser/data/pg4300.txt
> > -rw-r--r--   1 hduser supergroup1423801 2012-03-06 23:04
> > /user/hduser/data/pg5000.txt
> >
> > hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
> > wordcount /user/hduser/data /user/hduser/gutenberg-outputd
> >
> > 12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to
> process
> > : 3
> > 12/03/06 23:14:33 INFO mapred.JobClient: Running job:
> job_201203062221_0002
> > 12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
> > 12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
> > 12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
> > 12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
> > attempt_201203062221_0002_r_00_0, Status : FAILED
> > Error: java.lang.NullPointerException
> >at
> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >
> > 12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
> > attempt_201203062221_0002_r_00_1, Status : FAILED
> > Error: java.lang.NullPointerException
> >at
> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >
> > 12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
> > attempt_201203062221_0002_r_00_2, Status : FAILED
> > Error: java.lang.NullPointerException
> >at
> > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
> >at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
> >
> > 12/03/06 23:15:31 INFO mapred.JobClient: Job complete:
> job_201203062221_0002
> > 12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
> > 12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4
> > 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084
> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
> > reduces waiting after reserving slots (ms)=0
> > 12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all maps
> > waiting after reserving slots (ms)=0
> > 12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3
> > 12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3
> > 12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1
> > 12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799
> > 12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
> > 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520
> > 12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863
> > 12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287
> > 12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
> > 12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517
> > 12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
> > 12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized
> > bytes=1474341
> > 12/03/06 23:15:31 INFO mapred.JobClient: Combine output
> records=102322
> > 12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932
> > 12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640
> > 12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095
> > 12/03/06 23:15:31 INFO mapred.JobClient: Combine input records=629172
> > 12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172
> > 12/03/06 23:15:31 INFO mapred.J

Re: Fair Scheduler Problem

2012-03-06 Thread Harsh J
Hello Hao,

Its best to submit CDH user queries to
https://groups.google.com/a/cloudera.org/group/cdh-user/topics
(cdh-u...@cloudera.org) where the majority of CDH users community
resides.

How do you determine that preemption did not/does not work? Preemption
between pools occurs if a pool's minShare isn't satisfied within
preemption-timeout seconds. In this case, it will preempt tasks from
other pools.

Your settings look alright on a high level. Does your log not carry
any preemption logs? What was your pool's share scenario when you
tried to observe if it works or not?

On Wed, Mar 7, 2012 at 8:35 AM, hao.wang  wrote:
> Hi ,All,
>    I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
> the fair Scheduler I find the scheduler seems  not support preemption.
>    Can anybody tell me whether preemption is supported in this version?
>    This is my configration:
>  mapred-site.xml
> 
>  mapred.jobtracker.taskScheduler
>  org.apache.hadoop.mapred.FairScheduler
> 
> 
>      mapred.fairscheduler.allocation.file
>      /usr/lib/hadoop-0.20/conf/fair-scheduler.xml
> 
> 
> mapred.fairscheduler.preemption
> true
> 
> 
> mapred.fairscheduler.preemption.only.log
> true
> 
> 
> mapred.fairscheduler.preemption.interval
> 15000
> 
> 
>  mapred.fairscheduler.weightadjuster
>  org.apache.hadoop.mapred.NewJobWeightBooster
> 
> 
>  mapred.fairscheduler.sizebasedweight
>  true
> 
> fair-scheduler.xml
> 
>   
>      10
>    5
>    200
>   80
>       100
>      30
>        1.0
>  
>  
>       10
>    5
>   80
>   80
>        5
>       30
>       1.0
>  
>  
>       10
>  
>    20
>   10
>   30
>   30
> 
>
> regards,
>
> 2012-03-07
>
>
>
> hao.wang



-- 
Harsh J


Re: Re: Fair Scheduler Problem

2012-03-06 Thread hao.wang
Hi, Thanks for your reply!
I have solved this problem by setting "mapred.fairscheduler.preemption.only.log 
" to "false". The preemption works!
But I don't know why can not set "mapred.fairscheduler.preemption.only.log " to 
"true". Is it a bug?

regards,

2012-03-07 



hao.wang 



发件人: Harsh J 
发送时间: 2012-03-07  14:14:05 
收件人: common-user 
抄送: 
主题: Re: Fair Scheduler Problem 
 
Hello Hao,
Its best to submit CDH user queries to
https://groups.google.com/a/cloudera.org/group/cdh-user/topics
(cdh-u...@cloudera.org) where the majority of CDH users community
resides.
How do you determine that preemption did not/does not work? Preemption
between pools occurs if a pool's minShare isn't satisfied within
preemption-timeout seconds. In this case, it will preempt tasks from
other pools.
Your settings look alright on a high level. Does your log not carry
any preemption logs? What was your pool's share scenario when you
tried to observe if it works or not?
On Wed, Mar 7, 2012 at 8:35 AM, hao.wang  wrote:
> Hi ,All,
>I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I use 
> the fair Scheduler I find the scheduler seems  not support preemption.
>Can anybody tell me whether preemption is supported in this version?
>This is my configration:
>  mapred-site.xml
> 
>  mapred.jobtracker.taskScheduler
>  org.apache.hadoop.mapred.FairScheduler
> 
> 
>  mapred.fairscheduler.allocation.file
>  /usr/lib/hadoop-0.20/conf/fair-scheduler.xml
> 
> 
> mapred.fairscheduler.preemption
> true
> 
> 
> mapred.fairscheduler.preemption.only.log
> true
> 
> 
> mapred.fairscheduler.preemption.interval
> 15000
> 
> 
>  mapred.fairscheduler.weightadjuster
>  org.apache.hadoop.mapred.NewJobWeightBooster
> 
> 
>  mapred.fairscheduler.sizebasedweight
>  true
> 
> fair-scheduler.xml
> 
>   
>  10
>5
>200
>   80
>   100
>  30
>1.0
>  
>  
>   10
>5
>   80
>   80
>5
>   30
>   1.0
>  
>  
>   10
>  
>20
>   10
>   30
>   30
> 
>
> regards,
>
> 2012-03-07
>
>
>
> hao.wang
-- 
Harsh J


Re: Re: Fair Scheduler Problem

2012-03-06 Thread Harsh J
Ah my bad that I missed it when reading your doc.

Yes that property being true would make it only LOG about preemption
scenarios, not do preemption.

On Wed, Mar 7, 2012 at 12:05 PM, hao.wang  wrote:
> Hi, Thanks for your reply!
> I have solved this problem by setting
> "mapred.fairscheduler.preemption.only.log " to "false". The preemption
> works!
> But I don't know why can not set "mapred.fairscheduler.preemption.only.log "
> to "true". Is it a bug?
>
> regards,
>
>
>
> 2012-03-07
> 
> hao.wang
> 
> 发件人: Harsh J
> 发送时间: 2012-03-07  14:14:05
> 收件人: common-user
> 抄送:
> 主题: Re: Fair Scheduler Problem
> Hello Hao,
> Its best to submit CDH user queries to
> https://groups.google.com/a/cloudera.org/group/cdh-user/topics
> (cdh-u...@cloudera.org) where the majority of CDH users community
> resides.
> How do you determine that preemption did not/does not work? Preemption
> between pools occurs if a pool's minShare isn't satisfied within
> preemption-timeout seconds. In this case, it will preempt tasks from
> other pools.
> Your settings look alright on a high level. Does your log not carry
> any preemption logs? What was your pool's share scenario when you
> tried to observe if it works or not?
> On Wed, Mar 7, 2012 at 8:35 AM, hao.wang  wrote:
>> Hi ,All,
>>I encountered a problem in using Cloudera Hadoop 0.20.2-cdh3u1. When I 
>> use the fair Scheduler I find the scheduler seems  not support preemption.
>>Can anybody tell me whether preemption is supported in this version?
>>This is my configration:
>>  mapred-site.xml
>> 
>>  mapred.jobtracker.taskScheduler
>>  org.apache.hadoop.mapred.FairScheduler
>> 
>> 
>>  mapred.fairscheduler.allocation.file
>>  /usr/lib/hadoop-0.20/conf/fair-scheduler.xml
>> 
>> 
>> mapred.fairscheduler.preemption
>> true
>> 
>> 
>> mapred.fairscheduler.preemption.only.log
>> true
>> 
>> 
>> mapred.fairscheduler.preemption.interval
>> 15000
>> 
>> 
>>  mapred.fairscheduler.weightadjuster
>>  org.apache.hadoop.mapred.NewJobWeightBooster
>> 
>> 
>>  mapred.fairscheduler.sizebasedweight
>>  true
>> 
>> fair-scheduler.xml
>> 
>>   
>>  10
>>5
>>200
>>   80
>>   100
>>  30
>>1.0
>>  
>>  
>>   10
>>5
>>   80
>>   80
>>5
>>   30
>>   1.0
>>  
>>  
>>   10
>>  
>>20
>>   10
>>   30
>>   30
>> 
>>
>> regards,
>>
>> 2012-03-07
>>
>>
>>
>> hao.wang
> --
> Harsh J



-- 
Harsh J


Re: how is userlogs supposed to be cleaned up?

2012-03-06 Thread Joep Rottinghuis
Aside from cleanup, it seems like you are running into max number of 
subdirectories per directory on ext3.

Joep

Sent from my iPhone

On Mar 6, 2012, at 10:22 AM, Chris Curtin  wrote:

> Hi,
> 
> We had a fun morning trying to figure out why our cluster was failing jobs,
> removing nodes from the cluster etc. The majority of the errors were
> something like:
> 
> 
> Error initializing attempt_201203061035_0047_m_02_0:
> 
> org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access
> `/disk1/userlogs/job_201203061035_0047': No such file or directory
> 
> 
> 
>at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
> 
>at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 
>at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 
>at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 
>at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 
>at
> org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533)
> 
>at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524)
> 
>at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
> 
>at
> org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
> 
>at
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216)
> 
>at
> org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352)
> 
> 
> 
> Finally we shutdown the entire cluster and found that the 'userlogs'
> directory on the failed nodes had 30,000+ directories and the 'live' nodes
> 25,000+. Looking at creation timestamps it looks like around adding
> 30,000th directory the node falls over.
> 
> 
> 
> Many of the directorys are weeks old and a few were months old.
> 
> 
> 
> Deleting ALL the directories on all the nodes allowed us to bring the
> cluster up and things to run again. (Some users are claiming it is running
> faster now?)
> 
> 
> 
> Our question: what is supposed to be cleaning up these directories? How
> often is that process or step taken?
> 
> 
> 
> We are running CDH3u3.
> 
> 
> 
> Thanks,
> 
> 
> 
> Chris