I agree with Matei.  Whether you will get good ROI on 10GigE depends very much 
on the types of jobs you run.
--Matt

On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote:

Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile 
your target Hadoop workload and see whether it's communication-bound. Hadoop 
jobs can definitely be communication-bound if you shuffle a lot of data between 
map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to 
decompression, running python, or just running expensive user code) or 
disk-IO-bound. You might be surprised at what your bottleneck is.

Matei

On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:

> Matt,
> Thanks, this is helpful, I was wondering if you may have some thoughts
> on the list of other potential benefits of 10GbE NICs for Hadoop
> (listed in my original e-mail to the list)?
> 
> regards,
> Saqib
> 
> -----Original Message-----
> From: Matthew Foley [mailto:ma...@yahoo-inc.com] 
> Sent: Tuesday, June 28, 2011 12:04 PM
> To: common-user@hadoop.apache.org
> Cc: Matthew Foley
> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
> 
> Hadoop common provides an abstract FileSystem class, and Hadoop applications
> should be designed to run on that.  HDFS is just one implementation of a
> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
> LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
> would fall under the LocalFileSystem model.
> 
> However, one of the core values of Hadoop is the model of "bring the
> computation to the data".  This does not seem viable with an NFS-based
> NAS-model storage subsystem.  Thus, while it will "work" for small clusters
> and small jobs, it is unlikely to scale with high performance to thousands
> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.
> 
> --Matt
> 
> 
> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
> 
> I see. However, Hadoop is designed to operate best with HDFS because of its
> inherent striping and blocking strategy - which is tracked by Hadoop.
> Going outside of that mechanism will probably yield poor results and/or
> confuse Hadoop.
> 
> Just my thoughts.
> 
> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
>> Darren,
>> Thanks, the last pt was basically about 10GbE potentially allowing the 
>> use of a network file system e.g. via NFS as an alternative to HDFS, 
>> the question is there any merit in this. Basically, I was exploring if 
>> the commercial clustered NAS products offer any high-availability or 
>> data management benefits for use with Hadoop?
>> 
>> Saqib
>> 
>> -----Original Message-----
>> From: Darren Govoni [mailto:dar...@ontrenet.com]
>> Sent: Tuesday, June 28, 2011 10:21 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>> 
>> Hadoop, like other parallel networked computation architectures is I/O 
>> bound, predominantly.
>> This means any increase in network bandwidth is "A Good Thing" and can 
>> have drastic positive effects on performance. All your points stem 
>> from this simple realization.
>> 
>> Although I'm confused by your #6. Hadoop already uses a distributed 
>> file system. HDFS.
>> 
>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
>>> Folks,
>>> 
>>> I've been digging into the potential benefits of using
>>> 
>>> 10 Gigabit Ethernet (10GbE) NIC server connections for
>>> 
>>> Hadoop and wanted to run what I've come up with
>>> 
>>> through initial research by the list for 'sanity check'
>>> 
>>> feedback. I'd very much appreciate your input on
>>> 
>>> the importance (or lack of it) of the following potential benefits of
>>> 
>>> 10GbE server connectivity as well as other thoughts regarding
>>> 
>>> 10GbE and Hadoop (My interest is specifically in the value
>>> 
>>> of 10GbE server connections and 10GbE switching infrastructure,
>>> 
>>> over scenarios such as bonded 1GbE server connections with
>>> 
>>> 10GbE switching).
>>> 
>>> 
>>> 
>>> 1.       HDFS Data Loading. The higher throughput enabled by 10GbE
>>> 
>>> server and switching infrastructure allows faster processing and
>>> 
>>> distribution of data.
>>> 
>>> 2.       Hadoop Cluster Scalability. High-performance for initial data
>>> processing
>>> 
>>> and distribution directly impacts the degree of parallelism or 
>>> scalability supported
>>> 
>>> by the cluster.
>>> 
>>> 3.       HDFS Replication. Higher speed server connections allows faster
>>> file replication.
>>> 
>>> 4.       Map/Reduce Shuffle Phase. Improved end-to-end throughput and
>>> latency directly impact the
>>> 
>>> shuffle phase of a data set reduction especially for tasks that are 
>>> at the document level
>>> 
>>> (including large documents) and lots of metadata generated by those 
>>> documents as well as video analytics and images.
>>> 
>>> 5.       Data Reporting. 10GbE server networking etwork performance can
>>> 
>>> improve data reporting performance, especially if the Hadoop cluster 
>>> is running
>>> 
>>> multiple data reductions.
>>> 
>>> 6.       Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could
>> be
>>> reorganized
>>> 
>>> to use a cluster or network file system. This would allow Hadoop even 
>>> with its Java implementation
>>> 
>>> to have higher performance I/O and not have to be so concerned with 
>>> disk drive density in the same server.
>>> 
>>> 7.       Others?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> thanks,
>>> 
>>> Saqib
>>> 
>>> 
>>> 
>>> Saqib Jang
>>> 
>>> Principal/Founder
>>> 
>>> Margalla Communications, Inc.
>>> 
>>> 1339 Portola Road, Woodside, CA 94062
>>> 
>>> (650) 274 8745
>>> 
>>> www.margallacomm.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 


Reply via email to