Re: Not a host:port pair when running balancer
Please try using the port number 8020. Hairong On 3/11/09 9:42 AM, Stuart White stuart.whi...@gmail.com wrote: I've been running hadoop-0.19.0 for several weeks successfully. Today, for the first time, I tried to run the balancer, and I'm receiving: java.lang.RuntimeException: Not a host:port pair: hvcwydev0601 In my hadoop-site.xml, I have this: property namefs.default.name/name valuehdfs://hvcwydev0601//value /property What do I need to change to get the balancer to work? It seems I need to add a port to fs.default.name. If so, what port? Can I just pick any port? If I specify a port, do I need to specify any other parms accordingly? I searched the forum, and found a few posts on this topic, but it seems that the configuration parms have changed over time, so I'm not sure what the current correct configuration is. Also, if fs.default.name is supposed to have a port, I'll point out that the docs don't say so: http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html The example given for fs.default.name is hdfs://hostname/. Thanks!
Re: Question about HDFS capacity and remaining
It's taken by non-dfs files. Hairong On 1/29/09 3:23 PM, Bryan Duxbury br...@rapleaf.com wrote: Hey all, I'm currently installing a new cluster, and noticed something a little confusing. My DFS is *completely* empty - 0 files in DFS. However, in the namenode web interface, the reported capacity is 3.49 TB, but the remaining is 3.25TB. Where'd that .24TB go? There are literally zero other files on the partitions hosting the DFS data directories. Where am I losing 240GB? -Bryan
Re: hadoop balanceing data
%Remaining is much more fluctuate than %dfs used. This is because dfs shares the disks with mapred and mapred tasks may use a lot of disks temporally. So trying to keep the same %free is impossible most of the time. Hairong On 1/19/09 10:28 PM, Billy Pearson sa...@pearsonwholesale.com wrote: Why do we not use the Remaining % in place of use Used % when we are selecting datanode for new data and when running the balancer. form what I can tell we are using the use % used and we do not factor in non DFS Used at all. I see a datanode with only a 60GB hard drive fill up completely 100% before the other servers that have 130+GB hard drives get half full. Seams like Trying to keep the same % free on the drives in the cluster would be more optimal in production. I know this still may not be perfect but would be nice if we tried. Billy
Re: getting HDFS to rack-aware mode
Using -w option for the set replication command will wait until replication is done. Then run fsck to check if the all blocks are on at least two racks. Hairong On 10/14/08 12:06 PM, Sriram Rao [EMAIL PROTECTED] wrote: Hi, We have a cluster where we running HDFS in non-rack-aware mode. Now, we want to switch HDFS to run in rack-aware mode. Apart from the config changes (and restarting HDFS), to rackify the existing data, we were thinking of increasing/decreasing replication level a few times to get the data spread. Are there any tools that will enable us to know when we are done? Sriram
RE: Could not get block locations. Aborting... exception
Does your failed map task open a lot of files to write? Could you please check the log of the datanode running at the machine where the map tasks failed? Do you see any error message containing exceeds the limit of concurrent xcievers? Hairong From: Bryan Duxbury [mailto:[EMAIL PROTECTED] Sent: Fri 9/26/2008 4:36 PM To: core-user@hadoop.apache.org Subject: Could not get block locations. Aborting... exception Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one: attempt_200809261607_0003_m_02_0: Exception closing file /tmp/ dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile attempt_200809261607_0003_m_02_0: java.io.IOException: Could not get block locations. Aborting... attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError (DFSClient.java:2143) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400 (DFSClient.java:1735) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run (DFSClient.java:1889) At approximately the same time, we start seeing lots of these errors in the namenode log: 2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.startFile: failed to create file /tmp/dustin/input/ input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. 2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/ _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x, DFSClient_attempt_200809261607_0003_m_02_1, true, 3, 67108864) from 10.100.11.83:60056: error: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal (FSNamesystem.java:952) at org.apache.hadoop.dfs.FSNamesystem.startFile (FSNamesystem.java:903) at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Eventually, the job fails because of these errors. Subsequent job runs also experience this problem and fail. The only way we've been able to recover is to restart the DFS. It doesn't happen every time, but it does happen often enough that I'm worried. Does anyone have any ideas as to why this might be happening? I thought that https://issues.apache.org/jira/browse/HADOOP-2669 might be the culprit, but today we upgraded to hadoop 0.18.1 and the problem still happens. Thanks, Bryan
Re: Unknown protocol to name node: JobSubmissionProtocol
JobClient is supposed to talk a JobTracker. But the stack trace shows that it talked to a namenode. Could you check your configuration to see if the jobtracker port # was set to be the same as the namenode port #. Hairong On 7/30/08 6:56 AM, Arv Mistry [EMAIL PROTECTED] wrote: Can anyone provide any hints as to why this might be happening; I have hadoop running all process' on one machine (for trouble-shooting) and when I go to submit a job from another machine I get the following exception; INFO | jvm 2| 2008/07/30 06:05:05 | 2008-07-30 06:05:05,117 ERROR [HadoopJobTool] java.io.IOException: Unknown protocol to name node: org.apache.hadoop.mapred.JobSubmissionProtocol INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.dfs.NameNode.getProtocolVersion(NameNode.java:84) INFO | jvm 2| 2008/07/30 06:05:05 | at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) INFO | jvm 2| 2008/07/30 06:05:05 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) INFO | jvm 2| 2008/07/30 06:05:05 | at java.lang.reflect.Method.invoke(Method.java:597) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) INFO | jvm 2| 2008/07/30 06:05:05 | INFO | jvm 2| 2008/07/30 06:05:05 | org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown protocol to name node: org.apache.hadoop.mapred.JobSubmissionProtocol INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.dfs.NameNode.getProtocolVersion(NameNode.java:84) INFO | jvm 2| 2008/07/30 06:05:05 | at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) INFO | jvm 2| 2008/07/30 06:05:05 | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) INFO | jvm 2| 2008/07/30 06:05:05 | at java.lang.reflect.Method.invoke(Method.java:597) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) INFO | jvm 2| 2008/07/30 06:05:05 | INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.Client.call(Client.java:557) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) INFO | jvm 2| 2008/07/30 06:05:05 | at $Proxy4.getProtocolVersion(Unknown Source) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:300) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:383) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.mapred.JobClient.init(JobClient.java:376) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.mapred.JobClient.init(JobClient.java:346) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:958) INFO | jvm 2| 2008/07/30 06:05:05 | at com.rialto.profiler.profiler.clickstream.hadoop.HadoopJobTool.run(Hadoop JobTool.java:129) INFO | jvm 2| 2008/07/30 06:05:05 | at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) INFO | jvm 2| 2008/07/30 06:05:05 | at com.rialto.profiler.profiler.clickstream.hadoop.HadoopJobTool.launchJob( HadoopJobTool.java:142) INFO | jvm 2| 2008/07/30 06:05:05 | at com.rialto.profiler.profiler.clickstream.RawStreamGenerator.run(RawStrea mGenerator.java:138)
Re: utility to get block locations for a HDFS file
Try bin/hadoop fsck. On 7/30/08 8:23 AM, Jun Rao [EMAIL PROTECTED] wrote: Hi, Is there a Hadoop utility that takes a directory and dumps the block locations for each file in that directory to a text output? Thanks, Jun IBM Almaden Research Center K55/B1, 650 Harry Road, San Jose, CA 95120-6099 [EMAIL PROTECTED] (408)927-1886 (phone) (408)927-3215 (fax)
Re: how does one rebalance data nodes
If you set dfs.datanode.du.reserved to be 10G, this guarantees that dfs won't use more than (the total partition space - 10G). In my opinion, dfs.datanode.du.pct is not of much use. So you can ignore it for now. Hairong On 5/29/08 8:32 AM, prasana.iyengar [EMAIL PROTECTED] wrote: 1. After adding new data nodes is there a way to force a rebalance the data blocks across the new nodes. We recently added 6 nodes to the cluster - the original 4 nodes seem to have 80+% hdfs usage. 2. In 0.16.0 i also have the following settings in hadoop-site.xml - . dfs.datanode.du.reserved - 10G [default = 0] dfs.datanode.du.pct - 0.9f [default = 0.98f] Q:will this stop the fillup of the data node @ 90% and/or 10G remaining on the partition [whichever is earlier] ? thanks, -prasana
Re: reading a directory children in DFS?
Your code is trying to list a directory in the local file system. You should use the dfs handler instead. Path[] children = FileUtil.status2paths(dfs.listStatus(parentDirectoryPath)); Hairong On 5/20/08 8:17 AM, Deyaa Adranale [EMAIL PROTECTED] wrote: hello, i have a problem in reading the children of a directory in the distributed file system of hadoop: when I read the results of the reduce, I know the output folder (which i have specified using JobConf), but I don't know the file names inside it, and i still does not know how to access them using Java code I have tried this: Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); String outputDir = Path inDir = new Path(); if (!fs.exists(inDir)) throw new Exception(Directory does not exist); File jInDir = new RawLocalFileSystem().pathToFile(inDir); String[] files = jInDir.list(); for (int i=0; ifiles.length; i++) { Path inFile = new Path(inDir, files[i]); but the files array is null, and i get a null pointer exception when files.length. any suggestions? I have searched on the internet, wiki and the archive, but could not find something useful. thanks for the help Deyaa
Re: About HDFS`s Certification and Authorization?
Release 0.15 does not have any permission/security control. Release 0.16 supports permission control. An initial design of user authentication is coming soon. A jira issue regarding this will open in the next couple of weeks. Please contribute if you have any ideas. Hairong On 5/16/08 1:32 AM, wangxiaowei [EMAIL PROTECTED] wrote: hi,all I now use hadoop-0.15.3.Does it`s HDFS have the functionality of certification and authorization? So that one user can just access one part of HDFS,and cann`t access other parts without permitting?If it does ,how can I implement it? Thanks a lot.
Re: Balancer not balancing 100%?
Please check the balancer user guide at http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2. pdf. As stated in the document, a cluster is balanced iff |utilization(DNi)-average utilization|threshold for each datanode DNi, When you run a balancer, the default threshold is 10%. If you want a cluster ends up to be more balanced, you may use a smaller threshold. Good luck, Hairong On 5/12/08 10:30 AM, Ted Dunning [EMAIL PROTECTED] wrote: I think the balancer has a pretty lenient feeling about what balanced means. If you want to shave off the last slivers, try the trick of increasing replication on each file, one at a time and then decreasing it after 30-60 seconds. You can do this at whatever rate your disk space limits you to (i.e. If your disk is 80% full, you can double the replication on 1/4 of your files without running out of disk). On 5/11/08 11:48 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Oh, and on top of the above, I just observed that even though bin/hadoop balancer exits immediately and reports the cluster is fully balanced, I do see *very* few blocks (1-2 blocks per node) getting moved every time I run balancer. It feels as if the balancer does actually find some blocks that it could move around, moves them, but then quickly gets lazy and just exits claiming the cluster is/was already balanced. I just ran balancer about 10 times and each time it moved a couple of blocks and then exited. Makes me want to do ugly stuff like: for ((i=1; i = ; i++)); do echo $i; bin/hadoop balancer; done ...just to get to the point where all 4 nodes have the same number of blocks and thus the same percentage of disk used... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Sunday, May 11, 2008 2:36:24 PM Subject: Balancer not balancing 100%? Hi, I have 4 identical nodes in a Hadoop cluster (all functioning as DNs). One of the 4 nodes is a new node that I recently added. I ran the balancer a few times and it did move some of the blocks from the other 3 nodes to the new node. However, the 4 nodes are still not 100% balanced (according to the GUI), even though running bin/hadoop balancer says the cluster is balanced: Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved The cluster is balanced. Exiting... Balancing took 666.0 milliseconds The 3 old DNs are about 60% full (around 24K blocks), which the 1 new DN is only about 50% full (around 21K blocks). I restarted the NN and re-ran the balancer, bug got the same output: The cluster is balanced. Exiting... Is this a bug or is it somehow possible for a cluster to be balanced, yet have nodes with different number of blocks? Thanks, Otis
Re: How to re-balance, NN safe mode
Otis, I would recommend you to follow the following steps: 1. Bring up all 4 DNs (both old and new). 2. Decommission the DN that you want to remove. See http://wiki.apache.org/hadoop/FAQ#17 3. Run balancer Hairong On 5/8/08 9:11 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, (I should prefix this by saying that bin/hadoop fsck reported corrupt HDFS after I replaced one of the DNs with a new/empty DN) I've removed 1 old DN and added 1 new DN . The cluster has 4 nodes total (all 4 act as DNs) and replication factor of 3. I'm trying to re-balance the data by following http://wiki.apache.org/hadoop/FAQ#6: - I stopped all daemons - I removed the old DN and added the new DN to conf/slaves - I started all daemons The new DN shows in the JT and NN GUIs and bin/hadoop dfsadmin -report shows it. At this point I expected NN to figure out that it needs to re-balance under-replicated blocks and start pushing data to the new DN. However, no data got copied to the new DN. I pumped the replication factor to 6 and restarted all daemons, but still nothing. I noticed the NN GUI says the NN is in safe mode, but it has been stuck there for 10+ minutes now - too long, it seems. I then tried running bin/hadoop balancer, but got this: $ bin/hadoop balancer Received an IO exception: org.apache.hadoop.dfs.SafeModeException: Cannot create file/system/balancer.id. Name node is in safe mode. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:947) at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931) ... ... So now I'm wondering what steps one need to follow when replacing a DN? Just pulling it out and listing a new one in conf/slaves leads to NN getting into the permanent(?) safe mode, it seems. I know I can run bin/hadoop dfsadmin -safemode leave but is that safe? ;) If I do that, will I then be able to run bin/hadoop balancer and get some replicas of the old HDFS data on the newly added DN? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: Corrupt HDFS and salvaging data
A default replication factor of 3 does not mean that every block's replication factor in the file system is 3. In case (1), some blocks have a replication factor which is less than 3. So the average replication factor is less than 3. But no missing replicas. In case 2, some blocks have zero replicas, so only 92.72564% are minimally replicated. Those missing blocks must have a replication factor of 1 and were placed on the removed DN. Hairong On 5/9/08 7:16 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, Here are 2 bin/hadoop fsck / -files -blocks locations reports: 1) For the old HDFS cluster, reportedly HEALTHY, but with this inconsistency: http://www.krumpir.com/fsck-old.txt.zip ( 1MB) Total blocks: 32264 (avg. block size 11591245 B) Minimally replicated blocks: 32264 (100.0 %) == looks GOOD, matches Total blocks Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 == should have 3 copies of each block Average block replication: 2.418051== ??? shouldn't this be 3?? Missing replicas: 0 (0.0 %)== if the above is 2.41... how can I have 0 missing replicas? 2) For the cluster with 1 old DN replaced with 1 new DN: http://www.krumpir.com/fsck-1newDN.txt.zip ( 800KB) Minimally replicated blocks: 29917 (92.72564 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 17124 (53.074635 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 1.8145611 Missing replicas: 17124 (29.249296 %) Any help would be appreciated. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: lohit [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, May 9, 2008 2:47:39 AM Subject: Re: Corrupt HDFS and salvaging data When you say all daemons, do you mean the entire cluster, including the namenode? According to your explanation, this means that after I removed 1 DN I started missing about 30% of the blocks, right? No, You would only miss the replica. If all of your blocks have replication factor of 3, then you would miss only one replica which was on this DN. It would be good to see full report could you run hadoop fsck / -files -blocks -location? That would give you much more detailed information. - Original Message From: Otis Gospodnetic To: core-user@hadoop.apache.org Sent: Thursday, May 8, 2008 10:54:53 PM Subject: Re: Corrupt HDFS and salvaging data Lohit, I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started all daemons. I see the fsck report does include this: Missing replicas: 17025 (29.727087 %) According to your explanation, this means that after I removed 1 DN I started missing about 30% of the blocks, right? Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I removed? But how could that be when I have replication factor of 3? If I run bin/hadoop balancer with my old DN back in the cluster (and new DN removed), I do get the happy The cluster is balanced response. So wouldn't that mean that everything is peachy and that if my replication factor is 3 then when I remove 1 DN, I should have only some portion of blocks under-replicated, but not *completely* missing from HDFS? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: lohit To: core-user@hadoop.apache.org Sent: Friday, May 9, 2008 1:33:56 AM Subject: Re: Corrupt HDFS and salvaging data Hi Otis, Namenode has location information about all replicas of a block. When you run fsck, namenode checks for those replicas. If all replicas are missing, then fsck reports the block as missing. Otherwise they are added to under replicated blocks. If you specify -move or -delete option along with fsck, files with such missing blocks are moved to /lost+found or deleted depending on the option. At what point did you run the fsck command, was it after the datanodes were stopped? When you run namenode -format it would delete directories specified in dfs.name.dir. If directory exists it would ask for confirmation. Thanks, Lohit - Original Message From: Otis Gospodnetic To: core-user@hadoop.apache.org Sent: Thursday, May 8, 2008 9:00:34 PM Subject: Re: Corrupt HDFS and salvaging data Hi, Update: It seems fsck reports HDFS is corrupt when a significant-enough number of block replicas is missing (or something like that). fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN. After I restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS and started reporting *healthy* HDFS.
Re: Corrupt HDFS and salvaging data
Default replication factor takes effect only at the file creation time. If you want to increase the replication factor of existing blocks, you need to run command hadoop fs - setrep. It's better to finish decommission first, remove the old DN, and then rebalance. Rebalancing moves blocks around but does not replicate blocks. Hope it helps, Hairong On 5/9/08 9:38 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, A default replication factor of 3 does not mean that every block's replication factor in the file system is 3. Hm, and I thought that is exactly what it meant. What does it mean then? Or are you saying: The number of block replicas matches the r.f. that was in place when the block was *created* ? In case (1), some blocks have a replication factor which is less than 3. So the average replication factor is less than 3. But no missing replicas. Makes sense. Most likely due to the repl. fact. being = 1 at some point. But then why does bin/hadoop balancer tell me that the cluster is balanced? Does it not take into consideration the *current* replication factor? In case 2, some blocks have zero replicas, so only 92.72564% are minimally replicated. Those missing blocks must have a replication factor of 1 and were placed on the removed DN. Makes sense. So there are two things that need to be done: - get the blocks on the about to be removed DN off of that DN, so copies exist elsewhere (decommissioning) - get the cluster to re-balance, factoring in the *current* replication factor. (re-balancing) Is this correct? I think that's what your other email said (FAQ #17). I'm doing that now and it seems to be progressing, although I started the balancer immediately after running dfsadmin -refreshNodes (it didn't block, so I thought it didn't work...). I hope the fact that decomission and balancer are running simultaneously doesn't cause problems... Thanks! Otis On 5/9/08 7:16 AM, Otis Gospodnetic wrote: Hi, Here are 2 bin/hadoop fsck / -files -blocks locations reports: 1) For the old HDFS cluster, reportedly HEALTHY, but with this inconsistency: http://www.krumpir.com/fsck-old.txt.zip ( 1MB) Total blocks: 32264 (avg. block size 11591245 B) Minimally replicated blocks: 32264 (100.0 %) == looks GOOD, matches Total blocks Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 == should have 3 copies of each block Average block replication: 2.418051== ??? shouldn't this be 3?? Missing replicas: 0 (0.0 %)== if the above is 2.41... how can I have 0 missing replicas? 2) For the cluster with 1 old DN replaced with 1 new DN: http://www.krumpir.com/fsck-1newDN.txt.zip ( 800KB) Minimally replicated blocks: 29917 (92.72564 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 17124 (53.074635 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:3 Average block replication: 1.8145611 Missing replicas: 17124 (29.249296 %) Any help would be appreciated. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: lohit To: core-user@hadoop.apache.org Sent: Friday, May 9, 2008 2:47:39 AM Subject: Re: Corrupt HDFS and salvaging data When you say all daemons, do you mean the entire cluster, including the namenode? According to your explanation, this means that after I removed 1 DN I started missing about 30% of the blocks, right? No, You would only miss the replica. If all of your blocks have replication factor of 3, then you would miss only one replica which was on this DN. It would be good to see full report could you run hadoop fsck / -files -blocks -location? That would give you much more detailed information. - Original Message From: Otis Gospodnetic To: core-user@hadoop.apache.org Sent: Thursday, May 8, 2008 10:54:53 PM Subject: Re: Corrupt HDFS and salvaging data Lohit, I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started all daemons. I see the fsck report does include this: Missing replicas: 17025 (29.727087 %) According to your explanation, this means that after I removed 1 DN I started missing about 30% of the blocks, right? Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I removed? But how could that be when I have replication factor of 3? If I run bin/hadoop balancer with my old DN back in the cluster (and new DN removed), I do get the happy The cluster is balanced response. So wouldn't that mean that everything is peachy and that if my replication factor is 3 then when I remove 1 DN, I should have only some portion of blocks
Re: could only be replicated to 0 nodes, instead of 1
Could you please go to the dfs webUI and check how many datanodes are up and how much available space each has? Hairong On 5/8/08 3:30 AM, jasongs [EMAIL PROTECTED] wrote: I get the same error when doing a put and my cluster is running ok i.e. has capacity and all nodes are live. Error message is org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /test/test.txt could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1127) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:312) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901) at org.apache.hadoop.ipc.Client.call(Client.java:512) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:198) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocation Handler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandle r.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient .java:2074) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClien t.java:1967) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:148 7) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.jav a:1601) I would appreciate any help/suggestions Thanks jerrro wrote: I am trying to install/configure hadoop on a cluster with several computers. I followed exactly the instructions in the hadoop website for configuring multiple slaves, and when I run start-all.sh I get no errors - both datanode and tasktracker are reported to be running (doing ps awux | grep hadoop on the slave nodes returns two java processes). Also, the log files are empty - nothing is printed there. Still, when I try to use bin/hadoop dfs -put, I get the following error: # bin/hadoop dfs -put w.txt w.txt put: java.io.IOException: File /user/scohen/w4.txt could only be replicated to 0 nodes, instead of 1 and a file of size 0 is created on the DFS (bin/hadoop dfs -ls shows it). I couldn't find much information about this error, but I did manage to see somewhere it might mean that there are no datanodes running. But as I said, start-all does not give any errors. Any ideas what could be problem? Thanks. Jerr.
Re: Where is the files?
DFS files are mapped into blocks. Blocks are stored under dfs.data.dir/current. Hairong On 5/7/08 7:36 AM, hong [EMAIL PROTECTED] wrote: Hi All, I started Hadoop in standalone mode, and put some file on to HDSF. I strictly followed the instructions in Hadoop Quick Start. HDSF is mapped to a local directory in my local file system, right? and where is it? Thank you in advance!
Re: Read timed out, Abandoning block blk_-5476242061384228962
Taking the timeout out is very dangerous. It may cause your application to hang. You could change the timeout parameter to a larger number. HADOOP-2188 fixed the problem. Check https://issues.apache.org/jira/browse/HADOOP-2188. Hairong On 5/7/08 2:36 PM, James Moore [EMAIL PROTECTED] wrote: I noticed that there was a hard-coded timeout value of 6000 (ms) in src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I took that way down and now I'm not noticing the problem. (Doesn't mean it's not there, I just don't feel the pain...) This feels like a terrible solution^H^H^H^H^H^hack though, particularly since I haven't yet taken the time to actually understand the code.
Re: Where are passed the JobConf?
JobConf gets passed to a mapper in Mapper.configure(JobConf job). Check http://hadoop.apache.org/core/docs/r0.16.1/api/org/apache/hadoop/mapred/MapR educeBase.html#configure(org.apache.hadoop.mapred.JobConf) Hairong On 4/13/08 11:44 PM, Steve Han [EMAIL PROTECTED] wrote: I am reading Map/Reduce tutorial in official site of hadoop core.It said that Overall, Mapper implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf)http://hadoop.apache.org/core/docs/r0.16.1/ api/org/apache/hadoop/mapred/JobConfigurable.html#configure%28org.apache.hadoo p.mapred.JobConf%29method and override it to initialize themselves.Where is the place in the code JobConf is passed to Mapper implementation(in WordCount. v1.0 or v2.0)?Any idea?Thanks a lot.
Re: HDFS interface
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Hairong On 3/12/08 1:21 PM, Arun C Murthy [EMAIL PROTECTED] wrote: http://hadoop.apache.org/core/docs/r0.16.0/hdfs_user_guide.html Arun On Mar 12, 2008, at 1:16 PM, Cagdas Gerede wrote: I would like to use HDFS component of Hadoop but not interested in MapReduce. All the Hadoop examples I have seen so far uses MapReduce classes and from these examples there is no reference to HDFS classes including File System API of Hadoop (http://hadoop.apache.org/core/docs/current/api/org/ apache/hadoop/fs/FileSystem.html )http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/ fs/FileSystem.html Everything seems to happen under the hood. I was wondering if there is any example source code that is using HDFS directly. Thanks, - CEG
Re: HDFS interface
If you add the configuration directory to the class path, the configuration files will be automatically loaded. Hairong On 3/12/08 5:32 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I found the solution. Please let me know if you have a better idea. I added the following addResource lines. Configuration conf = new Configuration(); conf.addResource(new Path(location_of_hadoop-default.xml)); conf.addResource(new Path(location_of_hadoop-site.xml)); FileSystem fs = FileSystem.get(conf); (Would be good to update the wiki page). - CEG On Wed, Mar 12, 2008 at 5:04 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I see the following paragraphs in the wiki ( http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)http://wiki.apache.o rg/hadoop/HadoopDfsReadWriteExample Create a [image: [WWW]] FileSystemhttp://hadoop.apache.org/core/api/org/apache/hadoop/fs/FileSystem .htmlinstance by passing a new Configuration object. Please note that the following example code assumes that the Configuration object will automatically load the *hadoop-default.xml* and *hadoop-site.xml*configuration files. You may need to explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. and Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); When I do Path[] apples = fs.globPaths(new Path(*)); for(Path apple : apples) { System.out.println(apple); } It prints out all the local file names. How do I point my application to running HDFS instance? What does explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. mean? Thanks, - CEG
Re: Does Hadoop Honor Reserved Space?
I think you have a misunderstanding of the reserved parameter. As I commented on hadoop-1463, remember that dfs.du.reserve is the space for non-dfs usage, including the space for map/reduce, other application, fs meta-data etc. In your case since /usr already takes 45GB, it far exceeds the reserved limit 1G. You should set the reserved space to be 50G. Hairong On 3/10/08 4:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: Filed https://issues.apache.org/jira/browse/HADOOP-2991 -Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 12:56 PM To: core-user@hadoop.apache.org; core-user@hadoop.apache.org Cc: Pete Wyckoff Subject: RE: Does Hadoop Honor Reserved Space? folks - Jimmy is right - as we have unfortunately hit it as well: https://issues.apache.org/jira/browse/HADOOP-1463 caused a regression. we have left some comments on the bug - but can't reopen it. this is going to be affecting all 0.15 and 0.16 deployments! -Original Message- From: Hairong Kuang [mailto:[EMAIL PROTECTED] Sent: Thu 3/6/2008 2:01 PM To: core-user@hadoop.apache.org Subject: Re: Does Hadoop Honor Reserved Space? In addition to the version, could you please send us a copy of the datanode report by running the command bin/hadoop dfsadmin -report? Thanks, Hairong On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: but intermediate data is stored in a different directory from dfs/data (something like mapred/local by default i think). what version are u running? -Original Message- From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED] Sent: Thu 3/6/2008 10:14 AM To: core-user@hadoop.apache.org Subject: RE: Does Hadoop Honor Reserved Space? I've run into a similar issue in the past. From what I understand, this parameter only controls the HDFS space usage. However, the intermediate data in the map reduce job is stored on the local file system (not HDFS) and is not subject to this configuration. In the past I have used mapred.local.dir.minspacekill and mapred.local.dir.minspacestart to control the amount of space that is allowable for use by this temporary data. Not sure if that is the best approach though, so I'd love to hear what other people have done. In your case, you have a map-red job that will consume too much space (without setting a limit, you didn't have enough disk capacity for the job), so looking at mapred.output.compress and mapred.compress.map.output might be useful to decrease the job's disk requirements. --Ash -Original Message- From: Jimmy Wan [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2008 9:56 AM To: core-user@hadoop.apache.org Subject: Does Hadoop Honor Reserved Space? I've got 2 datanodes setup with the following configuration parameter: property namedfs.datanode.du.reserved/name value429496729600/value descriptionReserved space in bytes per volume. Always leave this much space free for non dfs use. /description /property Both are housed on 800GB volumes, so I thought this would keep about half the volume free for non-HDFS usage. After some long running jobs last night, both disk volumes were completely filled. The bulk of the data was in: ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data This is running as the user hadoop. Am I interpretting these parameters incorrectly? I noticed this issue, but it is marked as closed: http://issues.apache.org/jira/browse/HADOOP-2549
Re: Does Hadoop Honor Reserved Space?
In addition to the version, could you please send us a copy of the datanode report by running the command bin/hadoop dfsadmin -report? Thanks, Hairong On 3/6/08 11:56 AM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote: but intermediate data is stored in a different directory from dfs/data (something like mapred/local by default i think). what version are u running? -Original Message- From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED] Sent: Thu 3/6/2008 10:14 AM To: core-user@hadoop.apache.org Subject: RE: Does Hadoop Honor Reserved Space? I've run into a similar issue in the past. From what I understand, this parameter only controls the HDFS space usage. However, the intermediate data in the map reduce job is stored on the local file system (not HDFS) and is not subject to this configuration. In the past I have used mapred.local.dir.minspacekill and mapred.local.dir.minspacestart to control the amount of space that is allowable for use by this temporary data. Not sure if that is the best approach though, so I'd love to hear what other people have done. In your case, you have a map-red job that will consume too much space (without setting a limit, you didn't have enough disk capacity for the job), so looking at mapred.output.compress and mapred.compress.map.output might be useful to decrease the job's disk requirements. --Ash -Original Message- From: Jimmy Wan [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2008 9:56 AM To: core-user@hadoop.apache.org Subject: Does Hadoop Honor Reserved Space? I've got 2 datanodes setup with the following configuration parameter: property namedfs.datanode.du.reserved/name value429496729600/value descriptionReserved space in bytes per volume. Always leave this much space free for non dfs use. /description /property Both are housed on 800GB volumes, so I thought this would keep about half the volume free for non-HDFS usage. After some long running jobs last night, both disk volumes were completely filled. The bulk of the data was in: ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data This is running as the user hadoop. Am I interpretting these parameters incorrectly? I noticed this issue, but it is marked as closed: http://issues.apache.org/jira/browse/HADOOP-2549