On 29/09/11 13:28, Brian Bockelman wrote:

On Sep 29, 2011, at 1:50 AM, praveenesh kumar wrote:

Hi,

I want to know can we use SAN storage for Hadoop cluster setup ?
If yes, what should be the best pratices ?

Is it a good way to do considering the fact "the underlining power of Hadoop
is co-locating the processing power (CPU) with the data storage and thus it
must be local storage to be effective".
*But also, is it better to say “local is better” in the situation where I
have a single local 5400 RPM IDE drive, which  would be dramatically slower
than SAN storage striped  across many drives spinning at 10k RPM and
accessed via fiber channel ?*

Hi Praveenesh,

Two things:
1) If the option is a single 5400 RPM IDE drive (you can still buy those?) versus high-end SAN, the 
high-end SAN is going to win.  That's often false comparison: the question is often "What can 
I buy for $50k?".  In that case (setting aside organizational politics), you can buy more 
spindles in the "traditional" Hadoop setup than for the SAN.
   - Also, if you're latency limited, you're likely working against yourself.  
The best thing I ever did for my organization was make our software work just 
as well with 100ms latency as with 1ms latency.
2) As Paul pointed out, you have to ask yourself whether the SAN is shared or 
dedicated.  Many SANs don't have the ability to strongly partition workloads 
between users..

Brian


One more: SAN is a SPOF. [Gray05] includes the impact of a SAN outage on MS TerraServer, while [Jiang08] provides evidence that entry level FibreChannel storage is less reliable than SATA due to interconnects.

Anyone who criticises the NameNode for being a SPOF and relies on a SAN instead is missing something obvious.

[Gray05] Empirical Measurements of Disk Failure Rates and Error Rates
[Jiang08] Are disks the dominant contributor for storage failures?

Reply via email to