How often is safe depends on what probabilities you are willing to accept.
I just checked on one of clusters with 4PB of data, the scanner fixes
about 1 block a day. Assuming avg size of 64MB per block (pretty high),
probability that 3 copies of one replica go bad in 3 weeks is of the
range 1e-12. In reality it is mostly 2-3 orders less probable.
Raghu.
Brian Bockelman wrote:
On Nov 13, 2008, at 11:32 AM, Raghu Angadi wrote:
Brian Bockelman wrote:
Hey all,
I noticed that the maximum throttle for the datanode block scanner is
hardcoded at 8MB/s.
I think this is insufficient; on a fully loaded Sun Thumper, a full
scan at 8MB/s would take something like 70 days.
Is it possible to make this throttle a bit smarter? At the very
least, would anyone object to a patch which exposed this throttle as
a config option? Alternately, a smarter idea would be to throttle
the block scanner at (8MB/s) * (# of volumes), under the assumption
that there is at least 1 disk per volume.
Making the max configurable seems useful. Either of the above options
is fine, though the first one might be simpler for configuration.
8MB/s is calculated for around 4TB of data on a node. given 80k
seconds a day, it is around 6-7 days. 8-10 MB/s is not too bad a load
on 2-4 disk machine.
Hm... on second thought, however trivial the resulting disk I/O would
be, on the Thumper example, the maximum throttle would be 3Gbps:
that's a nontrivial load on the bus.
How do other "big sites" handle this? We're currently at 110TB raw,
are considering converting ~240TB over from another file system, and
are planning to grow to 800TB during 2009. A quick calculation shows
that to do a weekly scan at that size, we're talking ~10Gbps of
sustained reads.
You have a 110 TB on single datanode and moving to 800TB nodes? Note
that this rate applies to amount of data on a single datanode.
Nah -110TB total in the system (200 datanodes), and will move to 800TB
total (probably 250 datanodes).
However, we do have some larger nodes (we range from 80GB to 48TB per
node); recent and planned purchases are in the 4-8TB per node range, but
I'd sure hate to throw away 48TB of disks :)
On the 48TB node, a scan at 8MB/s would take 70 days. I'd have to run
at a rate of 80MB/s to scan through in 7 days. While 80MB/s over 48
disks is not much, I was curious about how the rest of the system would
perform (the node is in production on a different file system right now,
so borrowing it is not easy...); 80MB/s sounds like an awful lot for
"background noise".
Do any other large sites run such large nodes? How long of a period
between block scans do sites use in order to feel "safe" ?
Brian
Raghu.
I still worry that the rate is too low; if we have a suspicious node,
or users report a problematic file, waiting a week for a full scan is
too long. I've asked a student to implement a tool which can trigger
a full block scan of a path (the idea would be able to do "hadoop
fsck /path/to/file -deep"). What would be the best approach for him
to take to initiate a high-rate "full volume" or "full datanode" scan?