Re: Datanode block scans

2008-11-14 Thread Steve Loughran

Raghu Angadi wrote:


How often is safe depends on what probabilities you are willing to accept.

I just checked on one of clusters with 4PB of data, the scanner fixes 
about 1 block a day. Assuming avg size of 64MB per block (pretty high), 
probability that 3 copies of one replica go bad in 3 weeks is of the 
range 1e-12. In reality it is mostly 2-3 orders less probable.


Raghu.



That's quite interesting data. Any plans to publish a paper on disk 
failures in an HDFS cluster?


on a related note: do you ever scan the rest of the disk for trouble, 
that is the OS filesystem as root, just to catch problems in the server 
itself that could lead to failing jobs?





Re: Datanode block scans

2008-11-13 Thread Raghu Angadi

Brian Bockelman wrote:

Hey all,

I noticed that the maximum throttle for the datanode block scanner is 
hardcoded at 8MB/s.


I think this is insufficient; on a fully loaded Sun Thumper, a full scan 
at 8MB/s would take something like 70 days.


Is it possible to make this throttle a bit smarter?  At the very least, 
would anyone object to a patch which exposed this throttle as a config 
option?  Alternately, a smarter idea would be to throttle the block 
scanner at (8MB/s) * (# of volumes), under the assumption that there is 
at least 1 disk per volume.


Making the max configurable seems useful. Either of the above options is 
fine, though the first one might be simpler for configuration.


8MB/s is calculated for around 4TB of data on a node. given 80k seconds 
a day, it is around 6-7 days. 8-10 MB/s is not too bad a load on 2-4 
disk machine.


Hm... on second thought, however trivial the resulting disk I/O would 
be, on the Thumper example, the maximum throttle would be 3Gbps: that's 
a nontrivial load on the bus.


How do other big sites handle this?  We're currently at 110TB raw, are 
considering converting ~240TB over from another file system, and are 
planning to grow to 800TB during 2009.  A quick calculation shows that 
to do a weekly scan at that size, we're talking ~10Gbps of sustained reads.


You have a 110 TB on single datanode and moving to 800TB nodes? Note 
that this rate applies to amount of data on a single datanode.


Raghu.

I still worry that the rate is too low; if we have a suspicious node, or 
users report a problematic file, waiting a week for a full scan is too 
long.  I've asked a student to implement a tool which can trigger a full 
block scan of a path (the idea would be able to do hadoop fsck 
/path/to/file -deep).  What would be the best approach for him to take 
to initiate a high-rate full volume or full datanode scan?





Re: Datanode block scans

2008-11-13 Thread Brian Bockelman


On Nov 13, 2008, at 11:32 AM, Raghu Angadi wrote:


Brian Bockelman wrote:

Hey all,
I noticed that the maximum throttle for the datanode block scanner  
is hardcoded at 8MB/s.
I think this is insufficient; on a fully loaded Sun Thumper, a full  
scan at 8MB/s would take something like 70 days.
Is it possible to make this throttle a bit smarter?  At the very  
least, would anyone object to a patch which exposed this throttle  
as a config option?  Alternately, a smarter idea would be to  
throttle the block scanner at (8MB/s) * (# of volumes), under the  
assumption that there is at least 1 disk per volume.


Making the max configurable seems useful. Either of the above  
options is fine, though the first one might be simpler for  
configuration.


8MB/s is calculated for around 4TB of data on a node. given 80k  
seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a  
load on 2-4 disk machine.


Hm... on second thought, however trivial the resulting disk I/O  
would be, on the Thumper example, the maximum throttle would be  
3Gbps: that's a nontrivial load on the bus.
How do other big sites handle this?  We're currently at 110TB  
raw, are considering converting ~240TB over from another file  
system, and are planning to grow to 800TB during 2009.  A quick  
calculation shows that to do a weekly scan at that size, we're  
talking ~10Gbps of sustained reads.


You have a 110 TB on single datanode and moving to 800TB nodes? Note  
that this rate applies to amount of data on a single datanode.




Nah -110TB total in the system (200 datanodes), and will move to 800TB  
total (probably 250 datanodes).


However, we do have some larger nodes (we range from 80GB to 48TB per  
node); recent and planned purchases are in the 4-8TB per node range,  
but I'd sure hate to throw away 48TB of disks :)


On the 48TB node, a scan at 8MB/s would take 70 days.  I'd have to run  
at a rate of 80MB/s to scan through in 7 days.  While 80MB/s over 48  
disks is not much, I was curious about how the rest of the system  
would perform (the node is in production on a different file system  
right now, so borrowing it is not easy...); 80MB/s sounds like an  
awful lot for background noise.


Do any other large sites run such large nodes?  How long of a period  
between block scans do sites use in order to feel safe ?


Brian


Raghu.

I still worry that the rate is too low; if we have a suspicious  
node, or users report a problematic file, waiting a week for a full  
scan is too long.  I've asked a student to implement a tool which  
can trigger a full block scan of a path (the idea would be able to  
do hadoop fsck /path/to/file -deep).  What would be the best  
approach for him to take to initiate a high-rate full volume or  
full datanode scan?






Re: Datanode block scans

2008-11-13 Thread Raghu Angadi


How often is safe depends on what probabilities you are willing to accept.

I just checked on one of clusters with 4PB of data, the scanner fixes 
about 1 block a day. Assuming avg size of 64MB per block (pretty high), 
probability that 3 copies of one replica go bad in 3 weeks is of the 
range 1e-12. In reality it is mostly 2-3 orders less probable.


Raghu.

Brian Bockelman wrote:


On Nov 13, 2008, at 11:32 AM, Raghu Angadi wrote:


Brian Bockelman wrote:

Hey all,
I noticed that the maximum throttle for the datanode block scanner is 
hardcoded at 8MB/s.
I think this is insufficient; on a fully loaded Sun Thumper, a full 
scan at 8MB/s would take something like 70 days.
Is it possible to make this throttle a bit smarter?  At the very 
least, would anyone object to a patch which exposed this throttle as 
a config option?  Alternately, a smarter idea would be to throttle 
the block scanner at (8MB/s) * (# of volumes), under the assumption 
that there is at least 1 disk per volume.


Making the max configurable seems useful. Either of the above options 
is fine, though the first one might be simpler for configuration.


8MB/s is calculated for around 4TB of data on a node. given 80k 
seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a load 
on 2-4 disk machine.


Hm... on second thought, however trivial the resulting disk I/O would 
be, on the Thumper example, the maximum throttle would be 3Gbps: 
that's a nontrivial load on the bus.
How do other big sites handle this?  We're currently at 110TB raw, 
are considering converting ~240TB over from another file system, and 
are planning to grow to 800TB during 2009.  A quick calculation shows 
that to do a weekly scan at that size, we're talking ~10Gbps of 
sustained reads.


You have a 110 TB on single datanode and moving to 800TB nodes? Note 
that this rate applies to amount of data on a single datanode.




Nah -110TB total in the system (200 datanodes), and will move to 800TB 
total (probably 250 datanodes).


However, we do have some larger nodes (we range from 80GB to 48TB per 
node); recent and planned purchases are in the 4-8TB per node range, but 
I'd sure hate to throw away 48TB of disks :)


On the 48TB node, a scan at 8MB/s would take 70 days.  I'd have to run 
at a rate of 80MB/s to scan through in 7 days.  While 80MB/s over 48 
disks is not much, I was curious about how the rest of the system would 
perform (the node is in production on a different file system right now, 
so borrowing it is not easy...); 80MB/s sounds like an awful lot for 
background noise.


Do any other large sites run such large nodes?  How long of a period 
between block scans do sites use in order to feel safe ?


Brian


Raghu.

I still worry that the rate is too low; if we have a suspicious node, 
or users report a problematic file, waiting a week for a full scan is 
too long.  I've asked a student to implement a tool which can trigger 
a full block scan of a path (the idea would be able to do hadoop 
fsck /path/to/file -deep).  What would be the best approach for him 
to take to initiate a high-rate full volume or full datanode scan?