I agree with Matei. Whether you will get good ROI on 10GigE depends very much on the types of jobs you run. --Matt
On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote: Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is. Matei On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote: > Matt, > Thanks, this is helpful, I was wondering if you may have some thoughts > on the list of other potential benefits of 10GbE NICs for Hadoop > (listed in my original e-mail to the list)? > > regards, > Saqib > > -----Original Message----- > From: Matthew Foley [mailto:ma...@yahoo-inc.com] > Sent: Tuesday, June 28, 2011 12:04 PM > To: common-user@hadoop.apache.org > Cc: Matthew Foley > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop common provides an abstract FileSystem class, and Hadoop applications > should be designed to run on that. HDFS is just one implementation of a > valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported > LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage > would fall under the LocalFileSystem model. > > However, one of the core values of Hadoop is the model of "bring the > computation to the data". This does not seem viable with an NFS-based > NAS-model storage subsystem. Thus, while it will "work" for small clusters > and small jobs, it is unlikely to scale with high performance to thousands > of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. > > --Matt > > > On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: > > I see. However, Hadoop is designed to operate best with HDFS because of its > inherent striping and blocking strategy - which is tracked by Hadoop. > Going outside of that mechanism will probably yield poor results and/or > confuse Hadoop. > > Just my thoughts. > > On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: >> Darren, >> Thanks, the last pt was basically about 10GbE potentially allowing the >> use of a network file system e.g. via NFS as an alternative to HDFS, >> the question is there any merit in this. Basically, I was exploring if >> the commercial clustered NAS products offer any high-availability or >> data management benefits for use with Hadoop? >> >> Saqib >> >> -----Original Message----- >> From: Darren Govoni [mailto:dar...@ontrenet.com] >> Sent: Tuesday, June 28, 2011 10:21 AM >> To: common-user@hadoop.apache.org >> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >> >> Hadoop, like other parallel networked computation architectures is I/O >> bound, predominantly. >> This means any increase in network bandwidth is "A Good Thing" and can >> have drastic positive effects on performance. All your points stem >> from this simple realization. >> >> Although I'm confused by your #6. Hadoop already uses a distributed >> file system. HDFS. >> >> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >>> Folks, >>> >>> I've been digging into the potential benefits of using >>> >>> 10 Gigabit Ethernet (10GbE) NIC server connections for >>> >>> Hadoop and wanted to run what I've come up with >>> >>> through initial research by the list for 'sanity check' >>> >>> feedback. I'd very much appreciate your input on >>> >>> the importance (or lack of it) of the following potential benefits of >>> >>> 10GbE server connectivity as well as other thoughts regarding >>> >>> 10GbE and Hadoop (My interest is specifically in the value >>> >>> of 10GbE server connections and 10GbE switching infrastructure, >>> >>> over scenarios such as bonded 1GbE server connections with >>> >>> 10GbE switching). >>> >>> >>> >>> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >>> >>> server and switching infrastructure allows faster processing and >>> >>> distribution of data. >>> >>> 2. Hadoop Cluster Scalability. High-performance for initial data >>> processing >>> >>> and distribution directly impacts the degree of parallelism or >>> scalability supported >>> >>> by the cluster. >>> >>> 3. HDFS Replication. Higher speed server connections allows faster >>> file replication. >>> >>> 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and >>> latency directly impact the >>> >>> shuffle phase of a data set reduction especially for tasks that are >>> at the document level >>> >>> (including large documents) and lots of metadata generated by those >>> documents as well as video analytics and images. >>> >>> 5. Data Reporting. 10GbE server networking etwork performance can >>> >>> improve data reporting performance, especially if the Hadoop cluster >>> is running >>> >>> multiple data reductions. >>> >>> 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could >> be >>> reorganized >>> >>> to use a cluster or network file system. This would allow Hadoop even >>> with its Java implementation >>> >>> to have higher performance I/O and not have to be so concerned with >>> disk drive density in the same server. >>> >>> 7. Others? >>> >>> >>> >>> >>> >>> thanks, >>> >>> Saqib >>> >>> >>> >>> Saqib Jang >>> >>> Principal/Founder >>> >>> Margalla Communications, Inc. >>> >>> 1339 Portola Road, Woodside, CA 94062 >>> >>> (650) 274 8745 >>> >>> www.margallacomm.com >>> >>> >>> >>> >>> >>> >> > > >