Behaviour of FileSystem.rename() and hadoop fs -mv command
Hi, I'm hoping someone can point out where I've been particularly silly. The behaviour I've observed when using the fs shell -mv command is that it will create any parent directories whilst doing the move? This appears to call rename on the FileSystem, which, when I've tried invoking directly doesn't seem to create the parent directories- it just returns false? If I want to move a file (creating any parent directories) on HDFS is there an existing class I can use? Thanks, Paul
why not zookeeper for the namenode
Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Thanks for any explanation, Thomas Koch, http://www.koch.ro
Re: why not zookeeper for the namenode
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote: Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: I didn't work on that particular design, but I'll do my best to answer your questions below: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. HBase does not use Bookkeeper, currently. Rather, it just uses ZK for election and some small amount of metadata tracking. It therefore is only storing a small amount of data in ZK, whereas the Hadoop NN would have to store many GB worth of namesystem data. I don't think anyone has tried putting such a large amount of data in ZK yet, and being the first to do something is never without problems :) Additionally, when this design was made, Bookkeeper was very new. It's still in development, as I understand it. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? Yes. - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Potentially. However, all backup nodes is usually just 1. In our experience, and the experience of most other Hadoop deployments I've spoken with, the primary factors decreasing NN availability are *not* system crashes, but rather lack of online upgrade capability, slow restart time for planned restarts, etc. Adding a hot standby can help with the planned upgrade situation, but two standbys doesn't give you much reliability above one. In a datacenter, the failure correlations are generally such that racks either fail independently, or the entire DC has lost power. So, there aren't a lot of cases where 3 NN replicas would buy you much over 2. -Todd Thanks for any explanation, Thomas Koch, http://www.koch.ro
Data-Intensive Text Processing with MapReduce
Hi everyone, I'm pleased to present the first complete draft of a forthcoming book: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer The complete text is available at: http://www.umiacs.umd.edu/~jimmylin/book.html It's slated for publication by Morgan Claypool in mid-2010. This text is currently being used in the MapReduce course at the University of Maryland. The focus of the book is on algorithm design and thinking at scale. Quite explicitly, the book is *not* about Hadoop programming. Tom White's book already does that quite well... :) Table of Contents 1. Introduction 2. MapReduce Basics 3. MapReduce algorithm design 4. Inverted Indexing for Text Retrieval 5. Graph Algorithms 6. EM Algorithms for Text Processing 7. Closing Remarks We hope you find this resource helpful... Comments and feedback are welcome! Best, Jimmy
Re: Developing cross-component patches post-split
Hi, I am working on 0.18.3 branch and it works fine for me. Could you tell me - how you build the code? -Make sure to checkout the project and create a new javaproject Also make sure - not to use the default bin directory as there will be an issue becoz hadoop write some class file and eclipse overides them. Ankit -- View this message in context: http://old.nabble.com/Developing-cross-component-patches-post-split-tp27634796p27656854.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
JVM heap and sort buffer size guidelines
Hi, For a node with M gigabytes of memory and N total child tasks (both map + reduce) running on the node, what do people typically use for the following parameters: - Xmx (heap size per child task JVM)? I.e. my question here is what percentage of the total memory node do you use for the heaps of the tasks' JVMs. I am trying to reuse JVMs, and there are roughly N task-JVMs on one node at any time. I 've tried using a very large chunk of my memory of my node for heaps (i.e. close to M/N) and I have seen better execution times without experiencing swapping; but I am wondering if this is a job-specific behaviour. When I 've used both -Xmx and -Xms set to the same heap size (i.e. maximum and minum heap size the same to avoid contraction and expansion overheads) I have run into some swapping; I guess Xms=Xmx should be avoided if we are close to the physical memory limit. - io.sort.mb and io.sort.factor. I understand that to answer this we 'd have to take the disk configuration into consideration. Do you consider this only a function of disk or also a function of the heap size? Obviously io.sort.mb heapsize, but how much space do you leave for non-sort buffer usage? I am interested in small cluster setups ( 8-16 nodes), and not large clusters, if that makes any difference. - Vasilis
Re: JNI in MAp REuce
See http://issues.apache.org/jira/browse/HADOOP-2867 (and https://issues.apache.org/jira/browse/HADOOP-5980 if you are using 0.21 or Y! Hadoop w/LinuxTaskController). What version of Hadoop are you using? Also, if this is custom code, what does the runtime link path look like ( -R during compile time)? Using $ORIGIN might be useful here. On 2/19/10 8:47 AM, Utkarsh Agarwal unrealutka...@gmail.com wrote: How to set the LD_LIBRARY_PATH for the child , configuring mapred-site.xml doesn't work. Also setting -Djava.library.path is not good enough since it only gets the reference to the lib I am a trying to load(let's say lib.so) , but that lib has dependencies on other libs like lib1.so resulting in UnsatisfiedLinkError . Thus, LD_LIBRARY_PATH has to be set. On Thu, Feb 18, 2010 at 10:03 PM, Jason Venner jason.had...@gmail.comwrote: We used do this all the time at attributor. Now if I can remember how we did it. If the libraries are constant you can just install them on your nodes to save pushing them through the distributed cache, and then setup the LD_LIBRARY_PATH correctly. The key issue if you push them through the distributed cache is ensuring that the directory that the library gets dropped in, is actually in the runtime java.library.path You can also give explicit paths to System.load The -Djava.library.path in the child.options mapred.child.java.opts (if I have the param correct) should work also. On Thu, Feb 18, 2010 at 6:49 PM, Utkarsh Agarwal unrealutka...@gmail.com wrote: My .so file has other .so dependencies , so would I have to add them all in the DistributedCache . Also I tried setting LD_LIBRARY_PATH in mapred-site.xml as property namemapred.child.env/name valueLD_LIBRARY_PATH=/opt/libs//value /property doesnt work. the java.library.path is not sufficient to set , have to get LD_LIB set. -Utkarsh On Thu, Feb 18, 2010 at 3:14 PM, Allen Wittenauer awittena...@linkedin.comwrote: Like this: http://hadoop.apache.org/common/docs/current/native_libraries.html#Loading+n ative+libraries+through+DistributedCache On 2/16/10 5:29 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: How would this work? On Fri, Feb 12, 2010 at 10:45 AM, Allen Wittenauer awittena...@linkedin.com wrote: ... or just use distributed cache. On 2/12/10 10:02 AM, Alex Kozlov ale...@cloudera.com wrote: All native libraries should be on each of the cluster nodes. You need to set java.library.path property to point to your libraries (or just put them in the default system dirs). On Fri, Feb 12, 2010 at 9:12 AM, Utkarsh Agarwal unrealutka...@gmail.comwrote: Can anybody point me how to use JNI calls in a map reduce program. My .so files have other dependencies also , is there a way to load the LD_LIBRARY_PATH for child processes . Should all the native stuff be in HDFS? Thanks, Utkarsh. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: JNI in MAp REuce
I am using hadoop 0.20.1 , I added the attached patch still child processes don't get the path :( On Fri, Feb 19, 2010 at 10:57 AM, Allen Wittenauer awittena...@linkedin.com wrote: See http://issues.apache.org/jira/browse/HADOOP-2867 (and https://issues.apache.org/jira/browse/HADOOP-5980 if you are using 0.21 or Y! Hadoop w/LinuxTaskController). What version of Hadoop are you using? Also, if this is custom code, what does the runtime link path look like ( -R during compile time)? Using $ORIGIN might be useful here. On 2/19/10 8:47 AM, Utkarsh Agarwal unrealutka...@gmail.com wrote: How to set the LD_LIBRARY_PATH for the child , configuring mapred-site.xml doesn't work. Also setting -Djava.library.path is not good enough since it only gets the reference to the lib I am a trying to load(let's say lib.so) , but that lib has dependencies on other libs like lib1.so resulting in UnsatisfiedLinkError . Thus, LD_LIBRARY_PATH has to be set. On Thu, Feb 18, 2010 at 10:03 PM, Jason Venner jason.had...@gmail.com wrote: We used do this all the time at attributor. Now if I can remember how we did it. If the libraries are constant you can just install them on your nodes to save pushing them through the distributed cache, and then setup the LD_LIBRARY_PATH correctly. The key issue if you push them through the distributed cache is ensuring that the directory that the library gets dropped in, is actually in the runtime java.library.path You can also give explicit paths to System.load The -Djava.library.path in the child.options mapred.child.java.opts (if I have the param correct) should work also. On Thu, Feb 18, 2010 at 6:49 PM, Utkarsh Agarwal unrealutka...@gmail.com wrote: My .so file has other .so dependencies , so would I have to add them all in the DistributedCache . Also I tried setting LD_LIBRARY_PATH in mapred-site.xml as property namemapred.child.env/name valueLD_LIBRARY_PATH=/opt/libs//value /property doesnt work. the java.library.path is not sufficient to set , have to get LD_LIB set. -Utkarsh On Thu, Feb 18, 2010 at 3:14 PM, Allen Wittenauer awittena...@linkedin.comwrote: Like this: http://hadoop.apache.org/common/docs/current/native_libraries.html#Loading+n ative+libraries+through+DistributedCache On 2/16/10 5:29 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: How would this work? On Fri, Feb 12, 2010 at 10:45 AM, Allen Wittenauer awittena...@linkedin.com wrote: ... or just use distributed cache. On 2/12/10 10:02 AM, Alex Kozlov ale...@cloudera.com wrote: All native libraries should be on each of the cluster nodes. You need to set java.library.path property to point to your libraries (or just put them in the default system dirs). On Fri, Feb 12, 2010 at 9:12 AM, Utkarsh Agarwal unrealutka...@gmail.comwrote: Can anybody point me how to use JNI calls in a map reduce program. My .so files have other dependencies also , is there a way to load the LD_LIBRARY_PATH for child processes . Should all the native stuff be in HDFS? Thanks, Utkarsh. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals Index: src/mapred/mapred-default.xml === --- src/mapred/mapred-default.xml (revision 781689) +++ src/mapred/mapred-default.xml (working copy) @@ -406,6 +406,16 @@ /property property + namemapred.child.env/name + value/value + descriptionUser added environment variables for the task tracker child + processes. Example : + 1) A=foo This will set the env variable A to foo + 2) B=$B:c This is inherit tasktracker's B env variable. + /description +/property + +property namemapred.child.ulimit/name value/value descriptionThe maximum virtual memory, in KB, of a process launched by the Index: src/mapred/org/apache/hadoop/mapred/TaskRunner.java === --- src/mapred/org/apache/hadoop/mapred/TaskRunner.java (revision 781689) +++ src/mapred/org/apache/hadoop/mapred/TaskRunner.java (working copy) @@ -399,6 +399,25 @@ ldLibraryPath.append(oldLdLibraryPath); } env.put(LD_LIBRARY_PATH, ldLibraryPath.toString()); + + // add the env variables passed by the user + String mapredChildEnv = conf.get(mapred.child.env); + if (mapredChildEnv != null mapredChildEnv.length() 0) { +String childEnvs[] = mapredChildEnv.split(,); +for (String cEnv : childEnvs) { + String[] parts = cEnv.split(=); // split on '=' + String value = env.get(parts[0]); + if (value != null) { +// replace $env with the tt's value of env +value
HDFS vs Giant Direct Attached Arrays
Hadoop is great. Almost every day I live gives me more reasons to like it. My story for today: We have a system running a file system with a 48 TB Disk array on 4 shelves. Today I got this information about firmware updates. (don't you love firmware updates?) --- Any Controllers configured with any of the SATA hard drives listed in the Scope section might exhibit slow virtual disk initialization/expansion, rare drive stalls and timeouts, scrubbing errors, and reduced performance. Data might be at risk if multiple drive failures occur. Proactive hard drive replacement is neither necessary, nor authorized. Updating the firmware of disk drives in a virtual disk risks the loss of data and causes the drives to be temporarily inaccessible.” --- In a nutshell, the safest way is to offline the system and update disks one at a time (we don't know how long updating one disk takes). Or we have to smart fail disks and move them out of this array into another array, (lucky we have another one) apply the firmware, put the disk back in, wait for re-stripe! Repeat 47 times! So the options are: 1) Risky -- do the update online hope we do not corrupt the thing 2) slow -- offline the system update 1 disk at a time as suggested No option has 0 downtime. Also note that since this updates fixes reduced performance. Thus this chassis was never operating at max efficiency, due to whatever reason, RAID card complexity, firmware, back plane, whatever. Now, imagine if this was a 6 node hadoop systems with 8 disks a node, and we had to do a firmware updates. Wow! this would be easy. We could accomplish this with no system-wide outage, at our leisure. With a file replication factor of 3 we could hot swap disks, or even safely fail an entire node with no outage. We would not need spare hardware, need to inform people of an outage, or disable alerts. Hadoop would not care if the firmware on all the disks did not match. Hadoop did not have some complicated RAID that was running at reduced performance. all this time. Hadoop just uses independent disks, much less complexity. HDFS ForTheWin!
Some information on Hadoop Sort
Hello, I was wondering if some one could me some information on hadoop does the sorting. From what I have read there does not seem to be a map class and reduce class ? Where and how is the sorting parallelized ? Best Regards from Buffalo Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: Some information on Hadoop Sort
Hi, the sorting is done by the MapReduce framework. At map side, the output record will first go to a sorting buffer where the sorting, partitioning and combining (if there is combiner) happen. If necessary, multi-phase sorting is done to make a single sorted result for each map task. At reduce side, all the data from multiple map tasks will be merged (each of them is sorted at the map side, you only need merge sort here). It goes multiple rounds if necessary. -Gang - 原始邮件 发件人: aa...@buffalo.edu aa...@buffalo.edu 收件人: common-user@hadoop.apache.org 发送日期: 2010/2/19 (周五) 2:25:50 下午 主 题: Some information on Hadoop Sort Hello, I was wondering if some one could me some information on hadoop does the sorting. From what I have read there does not seem to be a map class and reduce class ? Where and how is the sorting parallelized ? Best Regards from Buffalo Abhishek Agrawal SUNY- Buffalo (716-435-7122) ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir
Hello, Not sure if i should post this here or on Cloudera's message board, but here goes. When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the env variables are hadoop-ec2), and launch a job hadoop jar ... I get the following error 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir at org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975) at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.godhuli.f.RHMR.submitAndMonitorJob(RHMR.java:195) but the value of mapred.local.dir is /mnt/hadoop/mapred/local Any ideas?
hadoop-streaming tutorial with -archives option
Hi, Hadoop/HDFS newbie. Been struggling with getting the streaming example working with -archives. c.f. http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed I've run into a couple issues. First issue is FileNotFoundException when the #symlink suffix is specified with the -archives or -files options as per the tutorial. hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink -input samples/cachefile/input.txt -mapper xargs cat -reducer cat -output samples/cachefile/out java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist. at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375) at org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:153) at org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:138) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) If I remove the #testlink from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation. I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have. 2nd issue is Unrecognized option -archives when -archives is specified at the end of the arg list. hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input samples/cachefile/input.txt -mapper xargs cat -reducer cat -output samples/cachefile/out9 -archives hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives Any help getting past this appreciated.Am I missing a configuration setting that allows symlinking? Really hoping to use the archives feature. -Michael
Fixed Re: On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir
As a follow up, this is not peculiar to my programs but hadoop ones too: I started the cluster as (using Fedora AMI and c1.medium) hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 test2 1 and error is [1] But it works(and so do my programs) if I launch a cluster with e.g. 4 nodes as opposed to 1. Regards Saptarshi [1] [r...@ip-10-250-195-225 ~]# hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.1+169.56-examples.jar wordcount /tmp/foo /tmp/f1 10/02/19 17:31:45 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively 10/02/19 17:31:45 INFO input.FileInputFormat: Total input paths to process : 1 org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir at org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975) at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.hadoop.examples.WordCount.main(WordCount.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) On Fri, Feb 19, 2010 at 5:13 PM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, Not sure if i should post this here or on Cloudera's message board, but here goes. When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the env variables are hadoop-ec2), and launch a job hadoop jar ... I get the following error 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir at org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975) at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at
reuse JVMs across multiple jobs
Hi, Is it possible (and does it make sense) to reuse JVMs across jobs? The job.reuse.jvm.num.tasks config option is a job specific parameter, as its name implies. When running multiple independent jobs simultaneously with job.reuse.jvm=-1 (this means always reuse), I see a lot of different Java PIDs (far more than map.tasks.maximum + reduce.tasks.maximum) across the duration of the job runs, instead of the same Java processes persisting. The number of live JVMs on a given node/tasktracker at any time never exceeds map.tasks.maximum + reduce.tasks.maximum, as expected, but we do tear down idle JVMs and spawn new ones quite often. for example, here are the number of distinct Java PIDs when submitting 1, 4, 32 copies of the same job in parallel: 1 28 2 39 4 106 32 740 The relevant killing and spawing logic should be in src/mapred/org/mapred/org/apache/hadoop/mapred/JvmManager.java, particularly the reapJvm() method, but I haven't dug deeper. I am wondering if it would be possible and worthwhile from a performance standpoint to be able to reuse JVMs across jobs i.e. have a common JVM pool for all hadoop jobs? thanks, - Vasilis
Re: Many child processes dont exit
Hello Edson, Thank you for your reply. I don't want to kill them, I want to know why these child processes don't exit, and to know how to make them exit successfully when they finish. Any ideas? Thank you. LvZheng 2010/2/18 Edson Ramiro erlfi...@gmail.com Do you want to kill them ? if yes, you can use ./bin/slaves.sh pkill java but it will kill the datanode and tasktracker processes in all slaves and you'll need to start these processes again. Edson Ramiro On 14 February 2010 22:09, Zheng Lv lvzheng19800...@gmail.com wrote: any idea? 2010/2/11 Zheng Lv lvzheng19800...@gmail.com Hello Everyone, We often find many child processes in datanodes, which have already finished for long time. And following are the jstack log: Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode): DestroyJavaVM prio=10 tid=0x2aaac8019800 nid=0x2422 waiting on condition [0x] java.lang.Thread.State: RUNNABLE NioProcessor-31 prio=10 tid=0x439fa000 nid=0x2826 runnable [0x4100a000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked 0x2aaab9b5f6f8 (a sun.nio.ch.Util$1) - locked 0x2aaab9b5f710 (a java.util.Collections$UnmodifiableSet) - locked 0x2aaab9b5f680 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65) at org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoProcessor.java:672) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) pool-15-thread-1 prio=10 tid=0x2aaac802d000 nid=0x2825 waiting on condition [0x41604000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x2aaab9b61620 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Attach Listener daemon prio=10 tid=0x43a22800 nid=0x2608 waiting on condition [0x] java.lang.Thread.State: RUNNABLE pool-3-thread-1 prio=10 tid=0x438ad000 nid=0x243f waiting on condition [0x42575000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x2aaab9ae8770 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Thread for syncLogs daemon prio=10 tid=0x2aaac4446800 nid=0x2437 waiting on condition [0x40f09000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Child$2.run(Child.java:87) Low Memory Detector daemon prio=10 tid=0x43771800 nid=0x242d runnable [0x] java.lang.Thread.State: RUNNABLE CompilerThread1 daemon prio=10 tid=0x4376e800 nid=0x242c waiting
hadoop-streaming tutorial with -archives option
Hi, Hadoop/HDFS newbie. Been struggling with getting the streaming example working with -archives. c.f. http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed I've run into a couple issues. First issue is FileNotFoundException when the #symlink suffix is specified with the -archives or -files options as per the tutorial. hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink -input samples/cachefile/input.txt -mapper xargs cat -reducer cat -output samples/cachefile/out java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist. at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375) at org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:153) at org.apache.hadoop.util.GenericOptionsParser.init(GenericOptionsParser.java:138) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) If I remove the #testlink from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation. I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have. 2nd issue is Unrecognized option -archives when -archives is specified at the end of the arg list. hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input samples/cachefile/input.txt -mapper xargs cat -reducer cat -output samples/cachefile/out9 -archives hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives Any help getting past this appreciated.Am I missing a configuration setting that allows symlinking? Really hoping to use the archives feature. -Michael