Re: Namenode BlocksMap on Disk

2008-12-01 Thread Doug Cutting

Billy Pearson wrote:
We are looking for a way to support smaller clusters also that might 
over run there heap size causing the cluster to crash.


Support for namespaces larger than RAM would indeed be a good feature to 
have.  Implementing this without impacting large cluster in-memory 
namenode performance should be possible, but may or may not be easy. 
You are welcome to tackle this task if it is a priority for you.


Doug


Namenode BlocksMap on Disk

2008-11-26 Thread Dennis Kubes
From time to time a message pops up on the mailing list about OOM 
errors for the namenode because of too many files.  Most recently there 
was a 1.7 million file installation that was failing.  I know the simple 
solution to this is to have a larger java heap for the namenode.  But 
the non-simple way would be to convert the BlocksMap for the NameNode to 
be stored on disk and then queried and updated for operations.  This 
would eliminate memory problems for large file installations but also 
might degrade performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk 
versus is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I am willing to put forth the work to make this happen.  Just want to 
make sure I am not going down the wrong path to begin with.


Dennis


Re: Namenode BlocksMap on Disk

2008-11-26 Thread Billy Pearson
I would like to see something like this also I run 32bit servers so I am 
limited on how much memory I can use for heap. Besides just storing to disk 
I would like to see some sort of cache like a block cache that will cache 
parts the BlocksMap this would help reduce the hits to disk for lookups and 
still give us the ability to lower the memory requirement for the namenode.


Billy


Dennis Kubes [EMAIL PROTECTED] wrote in 
message news:[EMAIL PROTECTED]
From time to time a message pops up on the mailing list about OOM errors 
for the namenode because of too many files.  Most recently there was a 1.7 
million file installation that was failing.  I know the simple solution to 
this is to have a larger java heap for the namenode.  But the non-simple 
way would be to convert the BlocksMap for the NameNode to be stored on 
disk and then queried and updated for operations.  This would eliminate 
memory problems for large file installations but also might degrade 
performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk versus 
is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I am willing to put forth the work to make this happen.  Just want to make 
sure I am not going down the wrong path to begin with.


Dennis






Re: Namenode BlocksMap on Disk

2008-11-26 Thread Sagar Naik
We can also try to mount the particular dir on ramfs and reduce the 
performance degradation


-Sagar
Billy Pearson wrote:
I would like to see something like this also I run 32bit servers so I 
am limited on how much memory I can use for heap. Besides just storing 
to disk I would like to see some sort of cache like a block cache that 
will cache parts the BlocksMap this would help reduce the hits to disk 
for lookups and still give us the ability to lower the memory 
requirement for the namenode.


Billy


Dennis Kubes [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
From time to time a message pops up on the mailing list about OOM 
errors for the namenode because of too many files.  Most recently 
there was a 1.7 million file installation that was failing.  I know 
the simple solution to this is to have a larger java heap for the 
namenode.  But the non-simple way would be to convert the BlocksMap 
for the NameNode to be stored on disk and then queried and updated 
for operations.  This would eliminate memory problems for large file 
installations but also might degrade performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk 
versus is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a 
reason why the BlocksMap shouldn't or couldn't be stored on disk?


I am willing to put forth the work to make this happen.  Just want to 
make sure I am not going down the wrong path to begin with.


Dennis








Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting

Dennis Kubes wrote:
2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I think the assumption is that it would be considerably more than slight 
degradation.  I've seen the namenode benchmarked at over 50,000 opens 
per second.  If file data is on disk, and the namespace is considerably 
bigger than RAM, then a seek would be required per access.  At 
10MS/seek, that would give only 100 opens per second, or 500x slower. 
Flash storage today peaks at around 5k seeks/second.


For smaller clusters the namenode might not need to be able to perform 
50k opens/second, but for larger clusters we do not want the namenode to 
become a bottleneck.


Doug


Re: Namenode BlocksMap on Disk

2008-11-26 Thread Raghu Angadi

Dennis Kubes wrote:


 From time to time a message pops up on the mailing list about OOM 
errors for the namenode because of too many files.  Most recently there 
was a 1.7 million file installation that was failing.  I know the simple 
solution to this is to have a larger java heap for the namenode.  But 
the non-simple way would be to convert the BlocksMap for the NameNode to 
be stored on disk and then queried and updated for operations.  This 
would eliminate memory problems for large file installations but also 
might degrade performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk 
versus is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


As Doug mentioned the main worry is that this will drastically reduce 
performance. Part of the reason is that large chunk of the work on 
NamenNode happens under a single global lock. So if there is seek under 
this lock, it affects every thing else.


One good long term fix for this is to make it easy to split the 
namespace between multiple namenodes.. There was some work done on 
supporting volumes. Also the fact that HDFS now supports symbolic 
links might make this easier for someone adventurous to use that as a 
quick hack to get around this.


If you have a rough prototype implementation I am sure there will be a 
lot of interest in evaluating it. If Java has any disk based or memory 
mapped data structures, that might be the quickest way to try its affects.


Raghu.


Re: Namenode BlocksMap on Disk

2008-11-26 Thread Brian Bockelman


On Nov 26, 2008, at 12:08 PM, Doug Cutting wrote:


Dennis Kubes wrote:
2) Besides possible slight degradation in performance, is there a  
reason why the BlocksMap shouldn't or couldn't be stored on disk?


I think the assumption is that it would be considerably more than  
slight degradation.  I've seen the namenode benchmarked at over  
50,000 opens per second.  If file data is on disk, and the namespace  
is considerably bigger than RAM, then a seek would be required per  
access.  At 10MS/seek, that would give only 100 opens per second, or  
500x slower. Flash storage today peaks at around 5k seeks/second.


For smaller clusters the namenode might not need to be able to  
perform 50k opens/second, but for larger clusters we do not want the  
namenode to become a bottleneck.




:)

Do you have any graphs you can share showing 50k opens / second (could  
be publicly or privately)?  The more external benchmarking data I  
have, the more I can encourage adoption amongst my university...


Brian



Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting

Brian Bockelman wrote:
Do you have any graphs you can share showing 50k opens / second (could 
be publicly or privately)?  The more external benchmarking data I have, 
the more I can encourage adoption amongst my university...


The 50k opens/second is from some internal benchmarks run at Y! nearly a 
year ago.  (It doesn't look like Y! runs that benchmark regularly 
anymore, as far as I can tell.)  I copied the graph to:


http://people.apache.org/~cutting/nn500.png

Note that all of the operations that modify the namespace top out at 
around 5k/second, since these are logged  flushed to disk.


I found some more recent micro namenode benchmarks at:

http://tinyurl.com/6bxoxz

These indicate that actual use doesn't hit these levels, but would 
still, on large clusters, be adversely affected by moving to a 
disk-based namespace.


Doug