HDFS: Forcing Master to Leave Safemode
What is the implication of manually forcing name node to leave safemode? What properties do HDFS lose by doing that? One gain to that is that the file system will be available for writes immediately. Cagdas -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Master Heap Size and Master Startup Time vs. Number of Blocks
In the system I am working, we have 6 million blocks total and the namenode heap size is about 600 MB and it takes about 5 minutes for namenode to leave the safemode. I try to estimate what would be the heap size if we have 100 - 150 million blocks, and what would be the amount of time for namenode to leave the safemode. From the extrapolation based on the numbers I have, I am calculating very scary numbers for both (Terabytes for heap size) and half an hour or so startup time. I am hoping that my extrapolation is not accurate. From your clusters, could you provide some numbers for number of files and blocks in the system vs. the master heap size and master startup time. I really appreciate your help. Thanks. Cagdas -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: Master Heap Size and Master Startup Time vs. Number of Blocks
Thanks Doug for your answers. Our interest is on more distributed file system part rather than map reduce. I must confess that our block size is not as large as how a lot of people configure. I appreciate if I can get your and others' input. Do you think these numbers are suitable? We will have 5 million files each having 20 blocks of 2MB. With the minimum replication of 3, we would have 300 million blocks. 300 million blocks would store 600TB. At ~10TB/node, this means a 60 node system. Do you think these numbers are suitable for Hadoop DFS. Cagdas At ~100M per block, 100M blocks would store 10PB. At ~1TB/node, this means a ~10,000 node system, larger than Hadoop currently supports well (for this and other reasons). Doug -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: Master Heap Size and Master Startup Time vs. Number of Blocks
fault tolerance. As files are uploaded into our server, we can continuously write the data in small chunks and if our server fails, we can tolerate this failure by switching our user to another server and the user can continue to write. Otherwise we have to wait on the server until we get the whole file to write it to Hadoop (if server fails then we lose all the data), or we need the user to cash all the data he is generating which is not feasible for our requirements. I appreciate your comment on this. Cagdas On Fri, May 2, 2008 at 1:09 PM, Ted Dunning [EMAIL PROTECTED] wrote: Why did you pick such a small block size? Why not go with the default of 64MB? That would give you only 10 million blocks for your 600TB. I don't see any advantage to the tiny block size. On 5/2/08 1:06 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: Thanks Doug for your answers. Our interest is on more distributed file system part rather than map reduce. I must confess that our block size is not as large as how a lot of people configure. I appreciate if I can get your and others' input. Do you think these numbers are suitable? We will have 5 million files each having 20 blocks of 2MB. With the minimum replication of 3, we would have 300 million blocks. 300 million blocks would store 600TB. At ~10TB/node, this means a 60 node system. Do you think these numbers are suitable for Hadoop DFS. Cagdas At ~100M per block, 100M blocks would store 10PB. At ~1TB/node, this means a ~10,000 node system, larger than Hadoop currently supports well (for this and other reasons). Doug -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: Block reports: memory vs. file system, and Dividing offerService into 2 threads
As far as I understand, the current focus is on how to reduce namenode's CPU time to process block reports from a lot of datanodes. Don't we miss another issue? Doesn't the way a block report is computed delays the master startup time. I have to make sure the master is up as quick as possible for maximum availability. The bottleneck seems like the scanning of the local disk. I wrote a simple java program that only scanned the datanode directories as Hadoop code did, and the time the java program took was equivalent to the 90% of the time that took for block report generation and sending. It seems scanning is very costly. It takes about 2-4 minutes. To address the problem, can we have *two types of block reports*. Once is generated from memory and the other from localfs. For master starts, we can trigger the block report that is generated from memory, and for periodic ones we can trigger the block report that is computed from localfs. Another issue I have is even if we do it block reports every 10 days, once it happens, it will almost freeze the datanode functions. More specifically, data node won't be able to report to namenode about new blocks until this report is computed. This takes at least a couple of minutes in my system for each datanode. As a result, master thinks a block is not yet replicated enough and it rejects addition of a new block to a file. Then, since it does not wait for enough time, it eventually causes the failure of writing a file. To address the first problem, can we separate this process of scanning the underlying disk as a separate thread then reporting of newly received blocks? Dhruba points out This sequential nature is critical in ensuring that there is no erroneous race condition in the Namenode I do not have any insight to this. Cagdas -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Block reports: memory vs. file system, and Dividing offerService into 2 threads
Currently, Block reports are computed by scanning all the files and folders in the local disk. This happens not only in startup, but also in periodic block reports. What is the intention behind doing this way instead of creating the block report from the data structure already located in the memory? One issue with the current approach is that it takes quite some time. For instance, the system I am working with has 1 million blocks in each data node and every time block report is computed, it takes about 4 minutes. You might expect this may not any effect. But there is. The problem is the computation and sending of block reports and sending the list of the new received blocks are part of the same thread. As a result, when the block report computation takes a long time, it delays the reporting of new received blocks. As you know when you are writing a multi block file and you request a new block from the namenode, the namenode checks whether the very previous block is replicated enough number of times. If at this point the data node is computing the block report, it is very likely that it didn't inform the namenode about the very previous block yet. As a result, the namenode rejects this. Then client tries to repeat this 5 more times while doubling the wait time in between up to about 6 seconds (starts with 200ms and doubles it 5 times). Then, client raises an exception to the application. Given that the block report computation takes minutes, this situation is very likely to occur. Do you have any suggestions on how to handle this situation? Are there any plans to take block reporting code segment and reporting of new received blocks to separate threads (ref: offerService method of DataNode.java file)? Thanks for your response, -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: Help: When is it safe to discard a block in the application layer
In DataStreamer class (in DFSClient.java), there is a line in run() method like this: * if (progress != null) { progress.progress(); }* I think the progress is called only if the block is replicated at least minimum number of times. I can pass my progress object and wait on it until this method is invoked to delete my application cache. Does this seem right? Am I missing something? Thanks for your answer. -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info On Thu, Apr 17, 2008 at 10:21 AM, dhruba Borthakur [EMAIL PROTECTED] wrote: The DFSClient caches small packets (e.g. 64K write buffers) and they are lazily flushed to the datanoeds in the pipeline. So, when an application completes a out.write() call, it is definitely not guaranteed that data is sent to even one datanode. One option would be to retrieve cache hints from the Namenode and determine if the block has three locations. Thanks, dhruba -- *From:* Cagdas Gerede [mailto:[EMAIL PROTECTED] *Sent:* Wednesday, April 16, 2008 7:40 PM *To:* core-user@hadoop.apache.org *Subject:* Help: When is it safe to discard a block in the application layer I am working on an application on top of Hadoop Distributed File System. High level flow goes like this: User data block arrives to the application server. The application server uses DistributedFileSystem api of Hadoop and write the data block to the file system. Once the block is replicated three times, the application server will notify the user so that the user can get rid of the data since it is now in a persistent fault tolerant storage. I couldn't figure out the following. Let's say, this is my sample program to write a block. byte data[] = new byte[blockSize]; out.write(data, 0, data.length); ... where out is out = fs.create(outFile, FsPermission.getDefault(), true, 4096, (short)replicationCount, blockSize, progress); My application writes the data to the stream and then it goes to the next line. At this point, I believe I cannot be sure that the block is replicated at least, say 3, times. Possibly, under the hood, the DFSClient is still trying to push this data to others. Given this, how is my application going to know that the data block is replicated 3 times and it is safe to discard this data? There are a couple of things you might think: 1) Set the minimum replica property to 3: Even if you do this, the application still goes to the next line before the data actually replicated 3 times. 2) Right after you write, you continuously get cache hints from master and check if master is aware of 3 replicas of this block: My problem with this approach is that the application will wait for a while for every block it needs to store since it will take some time for datanodes to report and master to process the blockreports. What is worse, if some datanode in the pipeline fails, we have no way of knowing the error. To sum-up, I am not sure when is the right time to discard a block of data with the guarantee that it is replicated certain number of times. Please help, Thanks, Cagdas
Benchmarking and Statistics for Hadoop Distributed File System
Does any body aware of any benchmarking of hadoop distributed file system? Some numbers I am interested in are - How long does it take for master to recover if there are, say, 1 million blocks in the system? - How does the recovery time change as the number of blocks in the system change? Is it linear? Is it exponential? - What is the file read/write throughput of Hadoop File System with different configurations and loads? -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Please Help: Namenode Safemode
I have a hadoop distributed file system with 3 datanodes. I only have 150 blocks in each datanode. It takes a little more than a minute for namenode to start and pass safemode phase. The steps for namenode start, as much as I understand, are: 1) Datanode send a heartbeat to namenode. Namenode tells datanode to send blockreport as a piggyback to heartbeat. 2) Datanode computes the block report. 3) Datanode sends it to Namenode. 4) Namenode processes the block report. 5) Namenode safe mode thread monitor checks for exiting, and namenode exist if threshold is reached and the extension time is passed. Here are my numbers: Step 1) Datanodes send heartbeats every 3 seconds. Step 2) Datanode computes the block report. (this takes about 20 miliseconds - as shown in the datanodes' logs) Step 3) No idea? (Depends on the size of blockreport. I suspect this should not be more than a couple of seconds). Step 4) No idea? Shouldn't be more than a couple of seconds. Step 5) Thread checks every second. The extension value in my configuration is 0. So there is no wait if threshold is achieved. Given these numbers, can any body explain where does one minute come from? Shouldn't this step take 10-20 seconds? Please help. I am very confused. -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: multiple datanodes in the same machine
Testing when I do not have 10 machines. On 4/15/08, Ted Dunning [EMAIL PROTECTED] wrote: Why do you want to do this perverse thing? How does it help to have more than one datanode per machine? And what in the world is better when you have 10? On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I have a follow-up question, Is there a way to programatically configure datanode parameters and start the datanode process? If I want to create 10 datanodes on the same host, do I have to create 10 config files? On Tue, Apr 15, 2008 at 12:29 PM, dhruba Borthakur [EMAIL PROTECTED] wrote: Yes, just point the Datanodes to different config files, different sets of ports, different data directories. Etc.etc. Thanks, dhruba -Original Message- From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 15, 2008 11:21 AM To: core-user@hadoop.apache.org Subject: multiple datanodes in the same machine Is there a way to run multiple datanodes in the same machine? -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
Re: multiple datanodes in the same machine
I am working on Distributed File System part. I do not use MR part, and I need to run multiple processes to test some scenarios on the file system. On Tue, Apr 15, 2008 at 1:37 PM, Ted Dunning [EMAIL PROTECTED] wrote: I have had no issues in scaling the number of datanodes. The location of the data is almost invisible to MR programs. I have had issues in going from local to distributed mode, but that has entirely been due to class path like issues. Since MR naturally restricts your focus, it is pretty much the rule that programs scale without much thought. If you test with two tasktrackers and one data node, you should have a pretty solid test environment. On 4/15/08 1:12 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Testing when I do not have 10 machines. On 4/15/08, Ted Dunning [EMAIL PROTECTED] wrote: Why do you want to do this perverse thing? How does it help to have more than one datanode per machine? And what in the world is better when you have 10? On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I have a follow-up question, Is there a way to programatically configure datanode parameters and start the datanode process? If I want to create 10 datanodes on the same host, do I have to create 10 config files? On Tue, Apr 15, 2008 at 12:29 PM, dhruba Borthakur [EMAIL PROTECTED] wrote: Yes, just point the Datanodes to different config files, different sets of ports, different data directories. Etc.etc. Thanks, dhruba -Original Message- From: Cagdas Gerede [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 15, 2008 11:21 AM To: core-user@hadoop.apache.org Subject: multiple datanodes in the same machine Is there a way to run multiple datanodes in the same machine? -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info -- Best Regards, Cagdas Evren Gerede Home Page: http://cagdasgerede.info
HDFS: What happens when a harddrive fails
I was wondering 1) what happens if a data node is alive but its harddrive fails? Does it throw an exception and dies? 2) If It continues to run and continue to do blockreporting, is there a console showing datanodes with healthy hard drives and unhealthy hard drives? I know the web server of the name node shows runing and not running data nodes, but I am not sure if it differentiates between datanodes with healthy and unhealthy hard drives. Thanks for your help Cagdas
HDFS: how to append
The HDFS documentation says it is possible to append to an HDFS file. In org.apache.hadoop.dfs.DistributedFileSystem class, there is no method to open an existing file for writing (there are methods for reading). Only similar methods are create methods which return FSDataOutputStream. When I look at FSDataOutputStream class, it seems there is no append method, and all write methods overwrite an existing file or return an error if such a file exists. Does anybody know how to append to a file in HDFS? I appreciate your help. Thanks, Cagdas
HDFS: Flash Application and Available APIs
I have two questions: - I was wondering if an HDFS client can be invoked from a Flash application. - What are the available APIs for HDFS? (I read that there is a C/C++ api for Hadoop Map/Reduce but is there a C/C++ API for HDFS? or Can it only be invoked from a Java application? Thanks for your help, Cagdas
HadoopDfsReadWriteExample
I tried HadoopDfsReadWriteExample. I am getting the following error. I appreciate any help. I provide more info at the end. Error while copying file Exception in thread main java.io.IOException: Cannot run program df: CreateProcess error=2, The system cannot find the file specified at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at java.lang.Runtime.exec(Runtime.java:593) at java.lang.Runtime.exec(Runtime.java:466) at org.apache.hadoop.fs.ShellCommand.runCommand(ShellCommand.java:48) at org.apache.hadoop.fs.ShellCommand.run(ShellCommand.java:42) at org.apache.hadoop.fs.DF.getAvailable(DF.java:72) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:326) at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:155) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.newBackupFile(DFSClient.java:1483) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.openBackupStream(DFSClient.java:1450) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:1592) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1728) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at HadoopDFSFileReadWrite.main(HadoopDFSFileReadWrite.java:106) Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified at java.lang.ProcessImpl.create(Native Method) at java.lang.ProcessImpl.init(ProcessImpl.java:81) at java.lang.ProcessImpl.start(ProcessImpl.java:30) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 17 more Note: I am on a Windows machine. The namenode is running in the same Windows machine. The way I initialized the configuration is: Configuration conf = new Configuration(); conf.addResource(new Path(C:\\cygwin\\hadoop-management\\hadoop-conf\\hadoop-site.xml)); FileSystem fs = FileSystem.get(conf); Any suggestions? Cagdas
Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails
I have a question. As we know, the name node forms a single point of failure. In a production environment, I imagine a name node would run in a data center. If that data center fails, how would you a put a new name node in place in another data center that can take over without minimum interruption? I was wondering if anyone has any experience/ideas/comments on this. Thanks -Cagdas
Re: Fault Tolerance: Inquiry for approaches to solve single point of failure when name node fails
If your data center fails, then you probably have to worry more about how to get your data. I assume having multiple data centers. I know thanks to HDFS replication data in the other data center will be enough. However, as much as I see for now, HDFS has no support for replication of namenode. Is this true? If there is no automated support, and If I need to do this replication with some custom code or manual intervention, what are the steps to do this replication? Any help is appreciated. Cagdas
Re: HDFS interface
I would like to use HDFS component of Hadoop but not interested in MapReduce. All the Hadoop examples I have seen so far uses MapReduce classes and from these examples there is no reference to HDFS classes including File System API of Hadoop (http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html )http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html Everything seems to happen under the hood. I was wondering if there is any example source code that is using HDFS directly. Thanks, - CEG
Searching email list
Is there an easy way to search this email list? I couldn't find any web interface. Please help. CEG
Re: HDFS interface
I see the following paragraphs in the wiki ( http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Create a [image: [WWW]] FileSystemhttp://hadoop.apache.org/core/api/org/apache/hadoop/fs/FileSystem.htmlinstance by passing a new Configuration object. Please note that the following example code assumes that the Configuration object will automatically load the *hadoop-default.xml* and *hadoop-site.xml*configuration files. You may need to explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. and Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); When I do Path[] apples = fs.globPaths(new Path(*)); for(Path apple : apples) { System.out.println(apple); } It prints out all the local file names. How do I point my application to running HDFS instance? What does explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. mean? Thanks, - CEG
Re: HDFS interface
I found the solution. Please let me know if you have a better idea. I added the following addResource lines. Configuration conf = new Configuration(); conf.addResource(new Path(location_of_hadoop-default.xml)); conf.addResource(new Path(location_of_hadoop-site.xml)); FileSystem fs = FileSystem.get(conf); (Would be good to update the wiki page). - CEG On Wed, Mar 12, 2008 at 5:04 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I see the following paragraphs in the wiki ( http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Create a [image: [WWW]] FileSystemhttp://hadoop.apache.org/core/api/org/apache/hadoop/fs/FileSystem.htmlinstance by passing a new Configuration object. Please note that the following example code assumes that the Configuration object will automatically load the *hadoop-default.xml* and *hadoop-site.xml*configuration files. You may need to explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. and Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); When I do Path[] apples = fs.globPaths(new Path(*)); for(Path apple : apples) { System.out.println(apple); } It prints out all the local file names. How do I point my application to running HDFS instance? What does explicitly add these resource paths if you are not running inside of the Hadoop runtime environment. mean? Thanks, - CEG -- Best Regards, Cagdas Evren Gerede Home Page: http://www.cs.ucsb.edu/~gerede Pronunciation: http://www.cs.ucsb.edu/~gerede/cagdas.html