Re: Datanode block scans

Raghu Angadi Thu, 13 Nov 2008 09:34:18 -0800

Brian Bockelman wrote:

Hey all,
I noticed that the maximum throttle for the datanode block scanner ishardcoded at 8MB/s.
I think this is insufficient; on a fully loaded Sun Thumper, a full scanat 8MB/s would take something like 70 days.
Is it possible to make this throttle a bit smarter? At the very least,would anyone object to a patch which exposed this throttle as a configoption? Alternately, a smarter idea would be to throttle the blockscanner at (8MB/s) * (# of volumes), under the assumption that there isat least 1 disk per volume.

Making the max configurable seems useful. Either of the above options isfine, though the first one might be simpler for configuration.

8MB/s is calculated for around 4TB of data on a node. given 80k secondsa day, it is around 6-7 days. 8-10 MB/s is not too bad a load on 2-4disk machine.

Hm... on second thought, however trivial the resulting disk I/O wouldbe, on the Thumper example, the maximum throttle would be 3Gbps: that'sa nontrivial load on the bus.
How do other "big sites" handle this? We're currently at 110TB raw, areconsidering converting ~240TB over from another file system, and areplanning to grow to 800TB during 2009. A quick calculation shows thatto do a weekly scan at that size, we're talking ~10Gbps of sustained reads.

You have a 110 TB on single datanode and moving to 800TB nodes? Notethat this rate applies to amount of data on a single datanode.


Raghu.

I still worry that the rate is too low; if we have a suspicious node, orusers report a problematic file, waiting a week for a full scan is toolong. I've asked a student to implement a tool which can trigger a fullblock scan of a path (the idea would be able to do "hadoop fsck/path/to/file -deep"). What would be the best approach for him to taketo initiate a high-rate "full volume" or "full datanode" scan?

Re: Datanode block scans

Reply via email to