Re: Namenode BlocksMap on Disk
Billy Pearson wrote: We are looking for a way to support smaller clusters also that might over run there heap size causing the cluster to crash. Support for namespaces larger than RAM would indeed be a good feature to have. Implementing this without impacting large cluster in-memory namenode performance should be possible, but may or may not be easy. You are welcome to tackle this task if it is a priority for you. Doug
Namenode BlocksMap on Disk
From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I am willing to put forth the work to make this happen. Just want to make sure I am not going down the wrong path to begin with. Dennis
Re: Namenode BlocksMap on Disk
I would like to see something like this also I run 32bit servers so I am limited on how much memory I can use for heap. Besides just storing to disk I would like to see some sort of cache like a block cache that will cache parts the BlocksMap this would help reduce the hits to disk for lookups and still give us the ability to lower the memory requirement for the namenode. Billy Dennis Kubes [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I am willing to put forth the work to make this happen. Just want to make sure I am not going down the wrong path to begin with. Dennis
Re: Namenode BlocksMap on Disk
We can also try to mount the particular dir on ramfs and reduce the performance degradation -Sagar Billy Pearson wrote: I would like to see something like this also I run 32bit servers so I am limited on how much memory I can use for heap. Besides just storing to disk I would like to see some sort of cache like a block cache that will cache parts the BlocksMap this would help reduce the hits to disk for lookups and still give us the ability to lower the memory requirement for the namenode. Billy Dennis Kubes [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I am willing to put forth the work to make this happen. Just want to make sure I am not going down the wrong path to begin with. Dennis
Re: Namenode BlocksMap on Disk
Dennis Kubes wrote: 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I think the assumption is that it would be considerably more than slight degradation. I've seen the namenode benchmarked at over 50,000 opens per second. If file data is on disk, and the namespace is considerably bigger than RAM, then a seek would be required per access. At 10MS/seek, that would give only 100 opens per second, or 500x slower. Flash storage today peaks at around 5k seeks/second. For smaller clusters the namenode might not need to be able to perform 50k opens/second, but for larger clusters we do not want the namenode to become a bottleneck. Doug
Re: Namenode BlocksMap on Disk
Dennis Kubes wrote: From time to time a message pops up on the mailing list about OOM errors for the namenode because of too many files. Most recently there was a 1.7 million file installation that was failing. I know the simple solution to this is to have a larger java heap for the namenode. But the non-simple way would be to convert the BlocksMap for the NameNode to be stored on disk and then queried and updated for operations. This would eliminate memory problems for large file installations but also might degrade performance slightly. Questions: 1) Is there any current work to allow the namenode to store on disk versus is memory? This could be a configurable option. 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? As Doug mentioned the main worry is that this will drastically reduce performance. Part of the reason is that large chunk of the work on NamenNode happens under a single global lock. So if there is seek under this lock, it affects every thing else. One good long term fix for this is to make it easy to split the namespace between multiple namenodes.. There was some work done on supporting volumes. Also the fact that HDFS now supports symbolic links might make this easier for someone adventurous to use that as a quick hack to get around this. If you have a rough prototype implementation I am sure there will be a lot of interest in evaluating it. If Java has any disk based or memory mapped data structures, that might be the quickest way to try its affects. Raghu.
Re: Namenode BlocksMap on Disk
On Nov 26, 2008, at 12:08 PM, Doug Cutting wrote: Dennis Kubes wrote: 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I think the assumption is that it would be considerably more than slight degradation. I've seen the namenode benchmarked at over 50,000 opens per second. If file data is on disk, and the namespace is considerably bigger than RAM, then a seek would be required per access. At 10MS/seek, that would give only 100 opens per second, or 500x slower. Flash storage today peaks at around 5k seeks/second. For smaller clusters the namenode might not need to be able to perform 50k opens/second, but for larger clusters we do not want the namenode to become a bottleneck. :) Do you have any graphs you can share showing 50k opens / second (could be publicly or privately)? The more external benchmarking data I have, the more I can encourage adoption amongst my university... Brian
Re: Namenode BlocksMap on Disk
Brian Bockelman wrote: Do you have any graphs you can share showing 50k opens / second (could be publicly or privately)? The more external benchmarking data I have, the more I can encourage adoption amongst my university... The 50k opens/second is from some internal benchmarks run at Y! nearly a year ago. (It doesn't look like Y! runs that benchmark regularly anymore, as far as I can tell.) I copied the graph to: http://people.apache.org/~cutting/nn500.png Note that all of the operations that modify the namespace top out at around 5k/second, since these are logged flushed to disk. I found some more recent micro namenode benchmarks at: http://tinyurl.com/6bxoxz These indicate that actual use doesn't hit these levels, but would still, on large clusters, be adversely affected by moving to a disk-based namespace. Doug