fairscheduler : group.name | Please edit patch to work for 0.20.205
Can someone have a look at the patch MAPREDUCE-2457 and see if it can be modified to work for 0.20.205? I am very new to java and have no idea what's going on in that patch. If you have any pointers for me, I will see if I can do it on my own. Thanks, Austin On Fri, Mar 2, 2012 at 7:15 PM, Austin Chungath austi...@gmail.com wrote: I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205. Are you sure this patch will work for 0.20.205? According to the description it says that the patch works for 0.21 and 0.22 and it says that 0.20 supports group.name without this patch... So does this patch also apply to 0.20.205? Thanks, Austin On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote: The group.name scheduler support was introduced in https://issues.apache.org/jira/browse/HADOOP-3892 but may have been broken by the security changes present in 0.20.205. You'll need the fix presented in https://issues.apache.org/jira/browse/MAPREDUCE-2457 to have group.name support. On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com wrote: I am running fair scheduler on hadoop 0.20.205.0 http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html The above page talks about the following property *mapred.fairscheduler.poolnameproperty* ** which I can set to *group.name* The default is user.name and when a user submits a job the fair scheduler assigns each user's job to a pool which has the name of the user. I am trying to change it to group.name so that the job is submitted to a pool which has the name of the user's linux group. Thus all jobs from any user from a specific group go to the same pool instead of an individual pool for every user. But *group.name* doesn't seem to work, has anyone tried this before? *user.name* and *mapred.job.queue.name* works. Is group.name supported in 0.20.205.0 because I don't see it mentioned in the docs? Thanks, Austin -- Harsh J
Re: Setting up Hadoop single node setup on Mac OS X
On 02/27/2012 11:53 AM, W.P. McNeill wrote: You don't need any virtualization. Mac OS X is Linux and runs Hadoop as is. Nitpick: OS X is NEXTSTEP based on Mach, which is a different POSIX-compliant system from Linux.
Re: AWS MapReduce
AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did your S3 billing would be massive :) EMR reads all input jar files and input data from S3, but it copies these files down to its local disk. It then does starts the MR process, doing all HDFS reads and writes to the local disks. At the end of the MR job, it copies the MR job output and all process logs to S3, and then tears down the VM instances. You can see this for yourself if you spin up a small EMR cluster, but turn off the configuration flag that kills the VMs at the end if the MR job. Then look at the hadoop configuration files to see how hadoop is configured. I really like EMR. Amazon has done a lot of work to optimize the hadoop configurations and VM instance AMIs to execute MR jobs fairly efficiently on a VM cluster. I had to do a lot of (expensive) trial and error work to figure out an optimal hadoop / VM configuration to run our MR jobs without crashing / timing out the jobs. The only reason we didnt standardize on EMR was that it strongly bound your code base / process to using EMR for hadoop processing, vs a flexible infrastructure that could use a local cluster or cluster on a different cloud provider. On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.comwrote: As far as I see in the docs it looks like you could also use hdfs instead of s3. But what I am not sure is if these are local disks or EBS. On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow. The setup is done pretty fast and there are some configuration parameters you can bypass - for example blocksizes etc. - but in the end imho setting up ec2 instances by copying images is the better alternative. Kind Regards Hannes On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Just want to check how many are using AWS mapreduce and understand the pros and cons of Amazon's MapReduce machines? Is it true that these map reduce machines are really reading and writing from S3 instead of local disks? Has anyone found issues with Amazon MapReduce and how does it compare with using MapReduce on local attached disks compared to using S3. --- www.informera.de Hadoop Big Data Services -- Thanks, John C
Re: AWS MapReduce
On Mon, Mar 5, 2012 at 7:40 AM, John Conwell j...@iamjohn.me wrote: AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did your S3 billing would be massive :) EMR reads all input jar files and input data from S3, but it copies these files down to its local disk. It then does starts the MR process, doing all HDFS reads and writes to the local disks. At the end of the MR job, it copies the MR job output and all process logs to S3, and then tears down the VM instances. You can see this for yourself if you spin up a small EMR cluster, but turn off the configuration flag that kills the VMs at the end if the MR job. Then look at the hadoop configuration files to see how hadoop is configured. I really like EMR. Amazon has done a lot of work to optimize the hadoop configurations and VM instance AMIs to execute MR jobs fairly efficiently on a VM cluster. I had to do a lot of (expensive) trial and error work to figure out an optimal hadoop / VM configuration to run our MR jobs without crashing / timing out the jobs. The only reason we didnt standardize on EMR was that it strongly bound your code base / process to using EMR for hadoop processing, vs a flexible infrastructure that could use a local cluster or cluster on a different cloud provider. Thanks for your input. I am assuming HDFS is created on ephemerial disks and not EBS. Also, is it possible to share some of your findings? On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.com wrote: As far as I see in the docs it looks like you could also use hdfs instead of s3. But what I am not sure is if these are local disks or EBS. On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow. The setup is done pretty fast and there are some configuration parameters you can bypass - for example blocksizes etc. - but in the end imho setting up ec2 instances by copying images is the better alternative. Kind Regards Hannes On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Just want to check how many are using AWS mapreduce and understand the pros and cons of Amazon's MapReduce machines? Is it true that these map reduce machines are really reading and writing from S3 instead of local disks? Has anyone found issues with Amazon MapReduce and how does it compare with using MapReduce on local attached disks compared to using S3. --- www.informera.de Hadoop Big Data Services -- Thanks, John C
Re: Custom Seq File Loader: ClassNotFoundException
Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Streaming is good for simulation. Long running map-only processes, where pig doesn't really help and it is simple to fire off a streaming process. You do have to set some options so they can take a long time to return/return counters. Russell Jurney http://datasyndrome.com On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn iefin...@gmail.com wrote: I'm really interested in this as well. I have trouble seeing a really good use case for streaming map-reduce. Is there something I can do in streaming that I can't do in Pig? If I want to re-use previously made Python functions from my code base, I can do that in Pig as much as Streaming, and from what I've experienced thus far, Python streaming seems to go slower than or at the same speed as Pig, so why would I want to write a whole lot of more-difficult-to-read mappers and reducers when I can do equally fast performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but currently I just can't think of the right use case. Eli On 3/2/12 9:21 AM, Subir S wrote: On Fri, Mar 2, 2012 at 12:38 PM, Harsh Jha...@cloudera.com wrote: On Fri, Mar 2, 2012 at 10:18 AM, Subir Ssubir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Basically we are comparing these two to see the benefits and how much they help in improving the productive coding time, without jeopardizing the performance of MR jobs. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become pretty slow with lot of joins, which we can achieve faster with writing raw MR jobs. So with that context was trying to see how Pig runs MR jobs. Like for example what kind of projects should consider Pig. Say when we have a lot of Joins, which writing with plain MR jobs takes time. Thoughts? Thank you Harsh for your comments. They are helpful! -- Harsh J
Re: Custom Seq File Loader: ClassNotFoundException
Unfortunately, public didn't change my error ... Any other ideas? Has anyone ran Hadoop on eclipse with custom sequence inputs ? Thank you, Mark On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote: Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: OutOfMemoryError: unable to create new native thread
Hi Rohini, The similar problem was just encountered for me yesterday. But for my situation, the max process num (ulimit -u) is set to 1024, which is too small. And when i increase it to 100, the problem gone. But u said Ulimit on the machine is set to unlimited, i'm not sure this will help or not :) And also check about `cat /proc/sys/kernel/threads-max', this seems to be a system-wide setting for total number of threads. On Tue, Mar 6, 2012 at 4:30 AM, Rohini U rohin...@gmail.com wrote: Hi All, I am running a map reduce job that uses around 120 MB of data and I get this out of memory error. Ulimit on the machine is set to unlimited. Any ideas on how to fix this? The stack trace is as given below: Exception in thread main org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289) at org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158) at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:782) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2938) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:2910) at org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:1974) at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3327) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.fatalError(Unknown Source) at org.apache.hadoop.mapred.Child.main(Child.java:325) Thanks -Rohini -- Kindest Regards, Clay Chiang
Re: Java Heap space error
All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
Re: Java Heap space error
Sorry for multiple emails. I did find: 2012-03-05 17:26:35,636 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:35,719 INFO org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 7816154 bytes from 1 objects. init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,881 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 715849728(699072K) used = 358720384(350312K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39) at java.nio.CharBuffer.allocate(CharBuffer.java:312) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105) at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote: All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
hadoop 1.0 / HOD or CloneZilla?
Hi all, I have experience with hadoop 0.20.204 on 3 machines cluster as pilot, now im trying to setup real cluster on 32 linux machines. I have some question: 1. is hadoop 1.0 stable?? in hadoop site this version is indicated as beta release 2. as you know installing and setting up hadoop in all 32 machines separately in not good idea, so what can i do? 1. using hadoop on demand (HOD)? 2. or using OS image replicate tools same as clozeZilla? i think this method is better because in addition to hadoop I can clone same other settings such as SSH or Samba in all machines. Let me know your idea, B.S, Masoud.
Re: why does my mapper class reads my input file twice?
Its your use of the mapred.input.dir property, which is a reserved name in the framework (its what FileInputFormat uses). You have a config you extract path from: Path input = new Path(conf.get(mapred.input.dir)); Then you do: FileInputFormat.addInputPath(job, input); Which internally, simply appends a path to a config prop called mapred.input.dir. Hence your job gets launched with two input files (the very same) - one added by default Tool-provided configuration (cause of your -Dmapred.input.dir) and the other added by you. Fix the input path line to use a different config: Path input = new Path(conf.get(input.path)); And run job as: hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt -Dmapred.output.dir=result On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne jane.wayne2...@gmail.com wrote: i have code that reads in a text file. i notice that each line in the text file is somehow being read twice. why is this happening? my mapper class looks like the following: public class MyMapper extends MapperLongWritable, Text, LongWritable, Text { private static final Log _log = LogFactory.getLog(MyMapper.class); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = (new StringBuilder()).append(value.toString()).append(m).toString(); context.write(key, new Text(s)); _log.debug(key.toString() + = + s); } } my reducer class looks like the following: public class MyReducer extends ReducerLongWritable, Text, LongWritable, Text { private static final Log _log = LogFactory.getLog(MyReducer.class); @Override public void reduce(LongWritable key, IterableText values, Context context) throws IOException, InterruptedException { for(IteratorText it = values.iterator(); it.hasNext();) { Text txt = it.next(); String s = (new StringBuilder()).append(txt.toString()).append(r).toString(); context.write(key, new Text(s)); _log.debug(key.toString() + = + s); } } } my job class looks like the following: public class MyJob extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new MyJob(), args); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Path input = new Path(conf.get(mapred.input.dir)); Path output = new Path(conf.get(mapred.output.dir)); Job job = new Job(conf, dummy job); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); job.setJarByClass(MyJob.class); return job.waitForCompletion(true) ? 0 : 1; } } the text file that i am trying to read in looks like the following. as you can see, there are 9 lines. T, T T, T T, T F, F F, F F, F F, F T, F F, T the output file that i get after my Job runs looks like the following. as you can see, there are 18 lines. each key is emitted twice from the mapper to the reducer. 0 T, Tmr 0 T, Tmr 6 T, Tmr 6 T, Tmr 12 T, Tmr 12 T, Tmr 18 F, Fmr 18 F, Fmr 24 F, Fmr 24 F, Fmr 30 F, Fmr 30 F, Fmr 36 F, Fmr 36 F, Fmr 42 T, Fmr 42 T, Fmr 48 F, Tmr 48 F, Tmr the way i execute my Job is as follows (cygwin + hadoop 0.20.2). hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt -Dmapred.output.dir=result originally, this happened when i read in a sequence file, but even for a text file, this problem is still happening. is it the way i have setup my Job? -- Harsh J