How often is safe depends on what probabilities you are willing to accept.

I just checked on one of clusters with 4PB of data, the scanner fixes about 1 block a day. Assuming avg size of 64MB per block (pretty high), probability that 3 copies of one replica go bad in 3 weeks is of the range 1e-12. In reality it is mostly 2-3 orders less probable.

Raghu.

Brian Bockelman wrote:

On Nov 13, 2008, at 11:32 AM, Raghu Angadi wrote:

Brian Bockelman wrote:
Hey all,
I noticed that the maximum throttle for the datanode block scanner is hardcoded at 8MB/s. I think this is insufficient; on a fully loaded Sun Thumper, a full scan at 8MB/s would take something like 70 days. Is it possible to make this throttle a bit smarter? At the very least, would anyone object to a patch which exposed this throttle as a config option? Alternately, a smarter idea would be to throttle the block scanner at (8MB/s) * (# of volumes), under the assumption that there is at least 1 disk per volume.

Making the max configurable seems useful. Either of the above options is fine, though the first one might be simpler for configuration.

8MB/s is calculated for around 4TB of data on a node. given 80k seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a load on 2-4 disk machine.

Hm... on second thought, however trivial the resulting disk I/O would be, on the Thumper example, the maximum throttle would be 3Gbps: that's a nontrivial load on the bus. How do other "big sites" handle this? We're currently at 110TB raw, are considering converting ~240TB over from another file system, and are planning to grow to 800TB during 2009. A quick calculation shows that to do a weekly scan at that size, we're talking ~10Gbps of sustained reads.

You have a 110 TB on single datanode and moving to 800TB nodes? Note that this rate applies to amount of data on a single datanode.


Nah -110TB total in the system (200 datanodes), and will move to 800TB total (probably 250 datanodes).

However, we do have some larger nodes (we range from 80GB to 48TB per node); recent and planned purchases are in the 4-8TB per node range, but I'd sure hate to throw away 48TB of disks :)

On the 48TB node, a scan at 8MB/s would take 70 days. I'd have to run at a rate of 80MB/s to scan through in 7 days. While 80MB/s over 48 disks is not much, I was curious about how the rest of the system would perform (the node is in production on a different file system right now, so borrowing it is not easy...); 80MB/s sounds like an awful lot for "background noise".

Do any other large sites run such large nodes? How long of a period between block scans do sites use in order to feel "safe" ?

Brian

Raghu.

I still worry that the rate is too low; if we have a suspicious node, or users report a problematic file, waiting a week for a full scan is too long. I've asked a student to implement a tool which can trigger a full block scan of a path (the idea would be able to do "hadoop fsck /path/to/file -deep"). What would be the best approach for him to take to initiate a high-rate "full volume" or "full datanode" scan?



Reply via email to