RE: Issue with usage of fs -test
Maybe https://issues.apache.org/jira/browse/HADOOP-3792 ? Koji -Original Message- From: pankaj jairath [mailto:pjair...@yahoo-inc.com] Sent: Thursday, May 28, 2009 4:49 AM To: core-user@hadoop.apache.org Subject: Issue with usage of fs -test Hello, I am facing a strange issue, where in the /fs -test -e/ fails and /fs -ls/ succeeds to list the file. Following is the grep of such a result : bin]$ hadoop fs -ls /projects/myproject///.done Found 1 items -rw--- 3 user hdfs 0 2009-03-19 22:28 /projects/myproject///.done [...@mymachine bin]$ echo $? 0 [...@mymachine bin]$ hadoop fs -test -e /projects/myproject///.done [...@mymachine bin]$ echo $? 1 What is the cause of such a behaviour, any pointers would much be appreciated. (HADOOP_CONF_DIR and HADOOP_HOME are set correctly at env vars) Thanks Pankaj
RE: Setting up another machine as secondary node
The secondary namenode takes a snapshot at 5 minute (configurable) intervals, This is a bit too aggressive. Checkpointing is still an expensive operation. I'd say every hour or even every day. Isn't the default 3600 seconds? Koji -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Thursday, May 14, 2009 7:46 AM To: core-user@hadoop.apache.org Subject: Re: Setting up another machine as secondary node any machine put in the conf/masters file becomes a secondary namenode. At some point there was confusion on the safety of more than one machine, which I believe was settled, as many are safe. The secondary namenode takes a snapshot at 5 minute (configurable) intervals, rebuilds the fsimage and sends that back to the namenode. There is some performance advantage of having it on the local machine, and some safety advantage of having it on an alternate machine. Could someone who remembers speak up on the single vrs multiple secondary namenodes? On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com wrote: First of all, the secondary namenode is not a what you might think a secondary is - it's not failover device. It does make a copy of the filesystem metadata periodically, and it integrates the edits into the image. It does *not* provide failover. Second, you specify its IP address in hadoop-site.xml. This is where you can override the defaults set in hadoop-default.xml. dbr On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani rakhi.khatw...@gmail.com wrote: Hi, I wanna set up a cluster of 5 nodes in such a way that node1 - master node2 - secondary namenode node3 - slave node4 - slave node5 - slave How do we go about that? there is no property in hadoop-env where i can set the ip-address for secondary name node. if i set node-1 and node-2 in masters, and when we start dfs, in both the m/cs, the namenode n secondary namenode processes r present. but i think only node1 is active. n my namenode fail over operation fails. ny suggesstions? Regards, Rakhi -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
RE: Setting up another machine as secondary node
Before 0.19, fsimage/edits were on the same directory. So whenever secondary finishes checkpointing, it copies back the fsimage while namenode still kept on writing to the edits file. Usually we observed some latency on the namenode side during that time. HADOOP-3948 would probably help after 0.19 or later. Koji -Original Message- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Thursday, May 14, 2009 10:32 AM To: core-user@hadoop.apache.org Subject: Re: Setting up another machine as secondary node Hey Koji, It's an expensive operation - for the secondary namenode, not the namenode itself, right? I don't particularly care if I stress out a dedicated node that doesn't have to respond to queries ;) Locally we checkpoint+backup fairly frequently (not 5 minutes ... maybe less than the default hour) due to sheer paranoia of losing metadata. Brian On May 14, 2009, at 12:25 PM, Koji Noguchi wrote: The secondary namenode takes a snapshot at 5 minute (configurable) intervals, This is a bit too aggressive. Checkpointing is still an expensive operation. I'd say every hour or even every day. Isn't the default 3600 seconds? Koji -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: Thursday, May 14, 2009 7:46 AM To: core-user@hadoop.apache.org Subject: Re: Setting up another machine as secondary node any machine put in the conf/masters file becomes a secondary namenode. At some point there was confusion on the safety of more than one machine, which I believe was settled, as many are safe. The secondary namenode takes a snapshot at 5 minute (configurable) intervals, rebuilds the fsimage and sends that back to the namenode. There is some performance advantage of having it on the local machine, and some safety advantage of having it on an alternate machine. Could someone who remembers speak up on the single vrs multiple secondary namenodes? On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com wrote: First of all, the secondary namenode is not a what you might think a secondary is - it's not failover device. It does make a copy of the filesystem metadata periodically, and it integrates the edits into the image. It does *not* provide failover. Second, you specify its IP address in hadoop-site.xml. This is where you can override the defaults set in hadoop-default.xml. dbr On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani rakhi.khatw...@gmail.com wrote: Hi, I wanna set up a cluster of 5 nodes in such a way that node1 - master node2 - secondary namenode node3 - slave node4 - slave node5 - slave How do we go about that? there is no property in hadoop-env where i can set the ip-address for secondary name node. if i set node-1 and node-2 in masters, and when we start dfs, in both the m/cs, the namenode n secondary namenode processes r present. but i think only node1 is active. n my namenode fail over operation fails. ny suggesstions? Regards, Rakhi -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
RE: Blocks replication in downtime even
http://hadoop.apache.org/core/docs/current/hdfs_design.html#Data+Disk+Fa ilure%2C+Heartbeats+and+Re-Replication hope this helps. Koji -Original Message- From: Stas Oskin [mailto:stas.os...@gmail.com] Sent: Monday, April 27, 2009 4:11 AM To: core-user@hadoop.apache.org Subject: Blocks replication in downtime even Hi. I have a question: If I have N of DataNodes, and one or several of the nodes have become unavailable, would HDFS re-synchronize the blocks automatically, according to replication level set? And if yes, when? As soon as the offline node was detected, or only on file access? Regards.
RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887
Owen, Is it just the patches that have already been applied to the 18 branch? Or are there more? Former. Just the patches that have already been applied to 0.18 branch. I especially want HADOOP-5465 in for the 'stable' release. (This patch is also missing in 0.19.1) Koji -Original Message- From: Owen O'Malley [mailto:omal...@apache.org] Sent: Thursday, April 23, 2009 11:54 AM To: core-user@hadoop.apache.org Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887 On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote: Nigel, When you have time, could you release 0.18.4 that contains some of the patches that make our clusters 'stable'? Is it just the patches that have already been applied to the 18 branch? Or are there more? -- Owen
RE: core-user Digest 23 Apr 2009 02:09:48 -0000 Issue 887
Nigel, When you have time, could you release 0.18.4 that contains some of the patches that make our clusters 'stable'? Koji -Original Message- From: Nigel Daley [mailto:nda...@yahoo-inc.com] Sent: Wednesday, April 22, 2009 10:31 PM To: core-user@hadoop.apache.org Subject: Re: core-user Digest 23 Apr 2009 02:09:48 - Issue 887 No, I didn't mark 0.19.1 stable. I left 0.18.3 as our most stable release. My company skipped deploying 0.19.x so I have no experience with that branch. Others? Nige Has the release 0.19 now become a stable one? On Wed, Apr 22, 2009 at 4:53 PM, Nigel Daley nda...@yahoo-inc.com wrote: Release 0.20.0 contains many improvements, new features, bug fixes and optimizations. For Hadoop release details and downloads, visit: http://hadoop.apache.org/core/releases.html Hadoop 0.20.0 Release Notes are at http://hadoop.apache.org/core/docs/r0.20.0/releasenotes.html Thanks to all who contributed to this release! Nigel
RE: Multiple outputs and getmerge?
Stuart, I once used MultipleOutputFormat and created (mapred.work.output.dir)/type1/part-_ (mapred.work.output.dir)/type2/part-_ ... And JobTracker took care of the renaming to (mapred.output.dir)/type{1,2}/part-__ Would that work for you? Koji -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Monday, April 20, 2009 1:15 PM To: core-user@hadoop.apache.org Subject: Multiple outputs and getmerge? I've written a MR job with multiple outputs. The normal output goes to files named part-X and my secondary output records go to files I've chosen to name ExceptionDocuments (and therefore are named ExceptionDocuments-m-X). I'd like to pull merged copies of these files to my local filesystem (two separate merged files, one containing the normal output and one containing the ExceptionDocuments output). But, since hadoop lands both of these outputs to files residing in the same directory, when I issue hadoop dfs -getmerge, what I get is a file that contains both outputs. To get around this, I have to move files around on HDFS so that my different outputs are in different directories. Is this the best/only way to deal with this? It would be better if hadoop offered the option of writing different outputs to different output directories, or if getmerge offered the ability to specify a file prefix for files desired to be merged. Thanks!
RE: Multiple outputs and getmerge?
Something in the lines of ... class MyOutputFormat extends MultipleTextOutputFormatText, Text { protected String generateFileNameForKeyValue(Text key, Text v, String name) { Path outpath = new Path(key.toString(), name); return outpath.toString(); } } would create a directory per key. If you just want to keep your side-effect files separate, then get your working dir by FileOutputFormat.getWorkOutputPath(...) or $mapred_work_output_dir and dfs -mkdir workdir/NewDir and put the secondary files there. Explained in http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf) Koji -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Tuesday, April 21, 2009 11:46 AM To: core-user@hadoop.apache.org Subject: Re: Multiple outputs and getmerge? On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi knogu...@yahoo-inc.com wrote: I once used MultipleOutputFormat and created (mapred.work.output.dir)/type1/part-_ (mapred.work.output.dir)/type2/part-_ ... And JobTracker took care of the renaming to (mapred.output.dir)/type{1,2}/part-__ Would that work for you? Can you please explain this in more detail? It looks like you're using MultipleOutputFormat for *both* of your outputs? So, you simply don't use the OutputCollector passed as a parm to Mapper#map()?
RE: mapred.tasktracker.map.tasks.maximum
It's probably a silly question, but you do have more than 2 mappers on your second job? If yes, I have no idea what's happening. Koji -Original Message- From: javateck javateck [mailto:javat...@gmail.com] Sent: Tuesday, April 21, 2009 1:38 PM To: core-user@hadoop.apache.org Subject: Re: mapred.tasktracker.map.tasks.maximum right, I set it in hadoop-site.xml before starting the whole hadoop processes, I have one job running fully utilizing the 10 map tasks, but subsequent queries are only using 2 of them, don't know why. I have enough RAM also, no paging out is happening, I'm running on 0.18.3. Right now I put all processes on one machine, namenode, datanode, jobtracker, tasktracker, I have a 2*4core CPU, and 20GB RAM. On Tue, Apr 21, 2009 at 1:25 PM, Koji Noguchi knogu...@yahoo-inc.comwrote: This is a cluster config and not a per job config. So this has to be set when the mapreduce cluster first comes up. Koji -Original Message- From: javateck javateck [mailto:javat...@gmail.com] Sent: Tuesday, April 21, 2009 1:20 PM To: core-user@hadoop.apache.org Subject: mapred.tasktracker.map.tasks.maximum I set my mapred.tasktracker.map.tasks.maximum to 10, but when I run a task, it's only using 2 out of 10, any way to know why it's only using 2? thanks
RE: reduce task specific jvm arg
This sounds like a reasonable request. Created https://issues.apache.org/jira/browse/HADOOP-5684 On our clusters, sometimes users want thin mappers and large reducers. Koji -Original Message- From: Jun Rao [mailto:jun...@almaden.ibm.com] Sent: Thursday, April 09, 2009 10:30 AM To: core-user@hadoop.apache.org Subject: reduce task specific jvm arg Hi, Is there a way to set jvm parameters only for reduce tasks in Hadoop? Thanks, Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 jun...@almaden.ibm.com
RE: Issue distcp'ing from 0.19.2 to 0.18.3
Bryan, hftp://ds-nn1:7276 hdfs://ds-nn2:7276 Are you using the same port number for hftp and hdfs? Looking at the stack trace, it seems like it failed before starting a distcp job. Koji -Original Message- From: Bryan Duxbury [mailto:br...@rapleaf.com] Sent: Wednesday, April 08, 2009 11:40 PM To: core-user@hadoop.apache.org Subject: Issue distcp'ing from 0.19.2 to 0.18.3 Hey all, I was trying to copy some data from our cluster on 0.19.2 to a new cluster on 0.18.3 by using disctp and the hftp:// filesystem. Everything seemed to be going fine for a few hours, but then a few tasks failed because a few files got 500 errors when trying to be read from the 19 cluster. As a result the job died. Now that I'm trying to restart it, I get this error: [rapl...@ds-nn2 ~]$ hadoop distcp hftp://ds-nn1:7276/ hdfs://ds- nn2:7276/cluster-a 09/04/08 23:32:39 INFO tools.DistCp: srcPaths=[hftp://ds-nn1:7276/] 09/04/08 23:32:39 INFO tools.DistCp: destPath=hdfs://ds-nn2:7276/ cluster-a With failures, global counters are inaccurate; consider running with -i Copy failed: java.net.SocketException: Unexpected end of file from server at sun.net.www.http.HttpClient.parseHTTPHeader (HttpClient.java:769) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) at sun.net.www.http.HttpClient.parseHTTPHeader (HttpClient.java:766) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) at sun.net.www.protocol.http.HttpURLConnection.getInputStream (HttpURLConnection.java:1000) at org.apache.hadoop.dfs.HftpFileSystem$LsParser.fetchList (HftpFileSystem.java:183) at org.apache.hadoop.dfs.HftpFileSystem $LsParser.getFileStatus(HftpFileSystem.java:193) at org.apache.hadoop.dfs.HftpFileSystem.getFileStatus (HftpFileSystem.java:222) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:667) at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:588) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:609) at org.apache.hadoop.tools.DistCp.run(DistCp.java:768) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:788) I changed nothing at all between the first attempt and the subsequent failed attempts. The only clues in the namenode log for the 19 cluster are: 2009-04-08 23:29:09,786 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.100.50.252:47733 got version 47 expected version 2 Anyone have any ideas? -Bryan
RE: Very assymetric data allocation
Marcus, One known issue in 0.18.3 is HADOOP-5465. CopyPaste from https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693 956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa nel#action_12693956 Hairong said: This bug might be caused by HADOOP-5465. Once a datanode hits HADOOP-5465, NameNode sends an empty replication request to the data node on every reply to a heartbeat, thus not a single scheduled block deletion request can be sent to the data node. (Also, if you're always writing from one of the nodes, that node is more likely to get full.) Nigel, not sure if this is the issue, but it would be nice to have 0.18.4 out. Koji -Original Message- From: Marcus Herou [mailto:marcus.he...@tailsweep.com] Sent: Tuesday, April 07, 2009 12:45 AM To: hadoop-u...@lucene.apache.org Subject: Very assymetric data allocation Hi. We are running Hadoop 0.18.3 and noticed a strange issue when one of our machines went out of disk yesterday. If you can see the table below it would display that the server mapredcoord is 66.91% allocated and the others are almost empty. How can that be ? Any information about this would be very helpful. mapredcoord is as well our jobtracker. //Marcus Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB) Blocks mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor t=50070dir=%2F2In Service416.6966.91 90.9419806 mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F2In Service416.696.71 303.54456 mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F2In Service416.690.44 351.693975 mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F0In Service416.690.25 355.821549 mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F2In Service416.690.42 347.683995 mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F0In Service416.690.43 352.73982 mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F0In Service416.690.5 351.914079 mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2F1In Service416.690.48 350.154169 -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
RE: Socket closed Exception
Hi Lohit, My initial guess would be https://issues.apache.org/jira/browse/HADOOP-4040 When this happened on our 0.17 cluster, all of our (task) clients were using the max idle time of 1 hour due to this bug instead of the configured value of a few seconds. Thus each client kept the connection up much longer than we expected. (Not sure if this applies to your 0.15 cluster, but it sounds similar to what we observed.) This worked until namenode started hitting the max limit of ' ipc.client.idlethreshold'. nameipc.client.idlethreshold/name value4000/value descriptionDefines the threshold number of connections after which connections will be inspected for idleness. /description When inspecting for idleness, namenode uses nameipc.client.maxidletime/name value12/value descriptionDefines the maximum idle time for a connected client after which it may be disconnected. /description As a result, many connections got disconnected at once. Clients only see the timeouts when they try to re-use that sockets the next time and wait for 1 minute. That's why they are not exactly at the same time, but *almost* the same time. # If this solves your problem, Raghu should get the credit. He spent so many hours to solve this mystery for us. :) Koji -Original Message- From: lohit [mailto:lohit...@yahoo.com] Sent: Sunday, March 29, 2009 11:56 AM To: core-user@hadoop.apache.org Subject: Socket closed Exception Recently we are seeing lot of Socket closed exception in our cluster. Many task's open/create/getFileInfo calls get back 'SocketException' with message 'Socket closed'. We seem to see many tasks fail with same error around same time. There are no warning or info messages in NameNode /TaskTracker/Task logs. (This is on HDFS 0.15) Are there cases where NameNode closes socket due heavy load or during conention of resource of anykind? Thanks, Lohit
RE: corrupt unreplicated block in dfs (0.18.3)
Mike, you might want to look at -move option in fsck. bash-3.00$ hadoop fsck Usage: DFSck path [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] path start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations I never use it since I would rather have users' jobs fail than jobs succeeding with incomplete inputs. Koji -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Thursday, March 26, 2009 9:41 AM To: core-user@hadoop.apache.org Subject: Re: corrupt unreplicated block in dfs (0.18.3) Just because a block is corrupt doesn't mean the entire file is corrupt. Furthermore, the presence/absence of a file in the namespace is a completely separate issue to the data in the file. I think it would be a surprising interface change if files suddenly disappeared just because 1 out of potentially many blocks were corrupt. - Aaron On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews m...@xoba.com wrote: i noticed that when a file with no replication (i.e., replication=1) develops a corrupt block, hadoop takes no action aside from the datanode throwing an exception to the client trying to read the file. i manually corrupted a block in order to observe this. obviously, with replication=1 its impossible to fix the block, but i thought perhaps hadoop would take some other action, such as deleting the file outright, or moving it to a corrupt directory, or marking it or keeping track of it somehow to note that there's un-fixable corruption in the filesystem? thus, the current behaviour seems to sweep the corruption under the rug and allows its continued existence, aside from notifying the specific client doing the read with an exception. if anyone has any information about this issue or how to work around it, please let me know. on the other hand, i tested that corrupting a block in a replication=3 file causes hadoop to re-replicate the block from another existing copy, which is good and is i what i expected. best, mike -- permanent contact information at http://mikerandrews.com
RE: streaming error when submit the job:Cannot run program chmod: java.io.IOException: error=12, Cannot allocate memory
Shixing, Discussion on https://issues.apache.org/jira/browse/HADOOP-5059 may be related. Koji -Original Message- From: shixing [mailto:paradise...@gmail.com] Sent: Wednesday, March 11, 2009 1:31 AM To: core-user@hadoop.apache.org Subject: streaming error when submit the job:Cannot run program chmod: java.io.IOException: error=12, Cannot allocate memory 09/03/11 15:43:55 ERROR streaming.StreamJob: Error Launching job : java.io.IOException: Cannot run program chmod: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286 ) at org.apache.hadoop.util.Shell.execCommand(Shell.java:338) at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.j ava:480) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem .java:472) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.jav a:274) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:3 64) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1238) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1219) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:247) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2426) at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:467) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:902) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 22 more and when I resubmit the job, it successes! -- Best wishes! My Friend~
RE: Potential race condition (Hadoop 18.3)
Ryan, If you're using getOutputPath, try replacing it with getWorkOutputPath. http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/ FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf ) Koji -Original Message- From: Ryan Shih [mailto:ryan.s...@gmail.com] Sent: Monday, March 02, 2009 11:01 AM To: core-user@hadoop.apache.org Subject: Potential race condition (Hadoop 18.3) Hi - I'm not sure yet, but I think I might be hitting a race condition in Hadoop 18.3. What seems to happen is that in the reduce phase, some of my tasks perform speculative execution but when the initial task completes successfully, it sends a kill to the new task started. After all is said and done, perhaps one in every five or ten which kill their second task ends up with zero or truncated output. When I code it to turn off speculative execution, the problem goes away. Are there known race conditions that I should be aware of around this area? Thanks in advance, Ryan
RE: how can I decommission nodes on-the-fly?
+1 Created Jira. https://issues.apache.org/jira/browse/HADOOP-4733 Koji Steve Loughran wrote: At some point in the future, I could imagine it being handy to have the ability to decomission a task tracker, which would tell it to stop accepting new work, and run the rest down. This would be good when tasks take time to run but you still want to be agile in your cluster management.
RE: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory
We had a similar issue before with Secondary Namenode failing with 2008-10-09 02:00:58,288 ERROR org.apache.hadoop.dfs.NameNode.Secondary: java.io.IOException: javax.security.auth.login.LoginException: Login failed: Cannot run program whoami: java.io.IOException: error=12, Cannot allocate memory In our case, simply increasing the swap space fixed our problem. http://hudson.gotdns.com/wiki/display/HUDSON/IOException+Not+enough+spac e When checking with strace, it was failing at [pid 7927] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4133c9f0) = -1 ENOMEM (Cannot allocate memory) Without CLONE_VM. In the clone man page, If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of clone. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2). Koji -Original Message- From: Brian Bockelman [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 3:12 PM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Hey Xavier, Don't forget, the Linux kernel reserves the memory; current heap space is disregarded. How much heap space does your data node and tasktracker get? (PS: overcommit ratio is disregarded if overcommit_memory=2). You also have to remember that there is some overhead from the OS, the Java code cache, and a bit from running the JVM. Add at least 64 MB per JVM for code cache and running, and we get 400MB of memory left for the OS and any other process running. You're definitely running out of memory. Either allow overcommitting (which will mean Java is no longer locked out of swap) or reduce memory consumption. Brian On Nov 18, 2008, at 4:57 PM, Xavier Stevens wrote: 1) It doesn't look like I'm out of memory but it is coming really close. 2) overcommit_memory is set to 2, overcommit_ratio = 100 As for the JVM, I am using Java 1.6. **Note of Interest**: The virtual memory I see allocated in top for each task is more than what I am specifying in the hadoop job/site configs. Currently each physical box has 16 GB of memory. I see the datanode and tasktracker using: RESVIRT Datanode145m 1408m Tasktracker 206m 1439m When idle. So taking that into account I do 16000 MB - (1408+1439) MB which would leave me with 13200 MB. In my old settings I was using 8 map tasks so 13200 / 8 = 1650 MB. My mapred.child.java.opts is -Xmx1536m which should leave me a little head room. When running though I see some tasks reporting 1900m. -Xavier -Original Message- From: Brian Bockelman [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 2:42 PM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Hey Xavier, 1) Are you out of memory (dumb question, but doesn't hurt to ask...)? What does Ganglia tell you about the node? 2) Do you have /proc/sys/vm/overcommit_memory set to 2? Telling Linux not to overcommit memory on Java 1.5 JVMs can be very problematic. Java 1.5 asks for min heap size + 1 GB of reserved, non- swap memory on Linux systems by default. The 1GB of reserved, non- swap memory is used for the JIT to compile code; this bug wasn't fixed until later Java 1.5 updates. Brian On Nov 18, 2008, at 4:32 PM, Xavier Stevens wrote: I'm still seeing this problem on a cluster using Hadoop 0.18.2. I tried dropping the max number of map tasks per node from 8 to 7. I still get the error although it's less frequent. But I don't get the error at all when using Hadoop 0.17.2. Anyone have any suggestions? -Xavier -Original Message- From: [EMAIL PROTECTED] On Behalf Of Edward J. Yoon Sent: Thursday, October 09, 2008 2:07 AM To: core-user@hadoop.apache.org Subject: Re: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory Thanks Alexander!! On Thu, Oct 9, 2008 at 4:49 PM, Alexander Aristov [EMAIL PROTECTED] wrote: I received such errors when I overloaded data nodes. You may increase swap space or run less tasks. Alexander 2008/10/9 Edward J. Yoon [EMAIL PROTECTED] Hi, I received below message. Can anyone explain this? 08/10/09 11:53:33 INFO mapred.JobClient: Task Id : task_200810081842_0004_m_00_0, Status : FAILED java.io.IOException: Cannot run program bash: java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator $AllocatorPerContext.getLocalPathF orWrite(LocalDirAllocator.java:296)
RE: Specify per file replication factor in dfs -put command line
Try hadoop dfs -D dfs.replication=2 -put abc bcd Koji -Original Message- From: Kevin [mailto:[EMAIL PROTECTED] Sent: Friday, August 29, 2008 11:11 AM To: core-user@hadoop.apache.org Subject: Specify per file replication factor in dfs -put command line Hi, Does any one happen to know how to specify the replication factor of a file when I upload it by the hadoop dfs -put command? Thank you! Best, -Kevin
RE: HDFS -rmr permissions
Hi Brian, I believe dfs -rmr does check the permission for each file. What's allowing you to delete other users data is the trash feature. Each user's Trash is expunged by the namenode process, which is a superuser. More discussion on (http://issues.apache.org/jira/browse/HADOOP-2514) My guess is, what we really need is a 'sticky bit' that won't allow dfs -mv for files/directories under a dir with 777 permission. I couldn't find a Jira so opened a new one. https://issues.apache.org/jira/browse/HADOOP-3953 Koji === (userB) hadoop dfs -ls / | grep ' /tmp' drwxrwxrwx - knoguchi supergroup 0 2008-08-14 16:47 /tmp (userB) hadoop dfs -Dfs.trash.interval=0 -ls /tmp Found 1 items drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir (userB) hadoop dfs -Dfs.trash.interval=0 -lsr /tmp drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir drwxr-xr-x - userA users 0 2008-08-14 16:45 /tmp/userA-dir/foo1 -rw-r--r-- 1 userA users 13 2008-08-14 16:45 /tmp/userA-dir/foo1/a -rw-r--r-- 1 userA users 15 2008-08-14 16:45 /tmp/userA-dir/foo1/b -rw-r--r-- 1 userA users 25 2008-08-14 16:45 /tmp/userA-dir/foo1/c (userB) hadoop dfs -Dfs.trash.interval=0 -rmr /tmp/userA-dir rmr: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=userB, access=ALL, inode=userA-dir:userA:users:rwxr-xr-x (userB) hadoop dfs -Dfs.trash.interval=1 -rmr /tmp/userA-dir Moved to trash: hdfs://ucdev13.inktomisearch.com:47522/tmp/userA-dir === -Original Message- From: Brian Karlak [mailto:[EMAIL PROTECTED] Sent: Thursday, August 07, 2008 11:27 AM To: core-user@hadoop.apache.org Cc: Colin Evans Subject: HDFS -rmr permissions Hello -- As far as I can tell, hadoop dfs -rmr only checks the permissions of the directory to be deleted and it's parent. Unlike Unix, however, it does not seem to check the permissions of the directories / files contained within the directory to be deleted. Is this by design? It seems dangerous. For instance, we have a directory where we want to allow people to deposit common resources for a project. Its permissions need to be 777, otherwise only one person can write to it. But with 777 permissions, any fool can accidentally wipe it. (Of course, if we have /trash set up, accidental writes are not as big a deal, but still ...) Thoughts / comments? Is there a way to make -rmr check the permissions of the files within the directories it's deleting, just as unix does? If not, is this a legit feature request? (I checked JIRA, but I didn't find anything on this ...) Thanks, Brian
RE: MapReduce with multi-languages
Hi. Asked Runping about this. Here's his reply. Koji = On 7/10/08 11:16 PM, Koji Noguchi [EMAIL PROTECTED] wrote: Runping, Can they use Buffer class? Koji Yes, use Buffer or ByteWritable for the key/value classes. But the critical point is to implement their own record reader/input format classes. Runping = -Original Message- From: NOMURA Yoshihide [mailto:[EMAIL PROTECTED] Sent: Thursday, July 10, 2008 10:36 PM To: core-user@hadoop.apache.org Subject: Re: MapReduce with multi-languages Mr. Taeho Kang, I need to analyze different character encoding text too. And I suggested to support encoding configuration in TextInputFormat. https://issues.apache.org/jira/browse/HADOOP-3481 But I think you should convert the text file encoding to UTF-8 at present. Regards, Taeho Kang: Dear Hadoop User Group, What are elegant ways to do mapred jobs on text-based data encoded with something other than UTF-8? It looks like Hadoop assumes the text data is always in UTF-8 and handles data that way - encoding with UTF-8 and decoding with UTF-8. And whenever the data is not in UTF-8 encoded format, problems arise. Here is what I'm thinking of to clear the situation.. correct and advise me if you see my approaches look bad! (1) Re-encode the original data with UTF-8? (2) Replace the part of source code where UTF-8 encoder and decoder are used? Or has anyone of you guys had trouble with running map-red job on data with multi-languages? Any suggestions/advices are welcome and appreciated! Regards, Taeho -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7106-6916) Fax: 044-754-2570 (Ext: 7108-7060) E-Mail: [EMAIL PROTECTED]