RE: Most Practical Bandwidth for a Hadoop Cluster?
Jeff, I've seen tests showing that 10 Gigabit Ethernet networking benefits Hadoop clusters and the benefit is especially pronounced if Hadoop node use SSDs on the back-end. Also, as each node does bother storage I/O and processing, 10GbE NICs that offload protocol processing are especially beneficial e.g High-Performance Networking for Optimized Hadoop Deployments http://www.chelsio.com/wp-content/uploads/2011/08/Hadoop-White-Paper-w-tuto rial-8.11.pdf Saqib -Original Message- From: Jeff Kubina [mailto:jeff.kub...@gmail.com] Sent: Thursday, March 15, 2012 11:30 AM To: common-user@hadoop.apache.org Subject: Most Practical Bandwidth for a Hadoop Cluster? Suppose you have Hadoop jobs that are communication-bound (due to lots of data shuffling between maps and reduces), what is the most practical network bandwidth to strive for in such a cluster? I think it should be the sustained read bandwidth of the disks on the nodes times the number of nodes, since any more bandwidth than this could not be utilized. Agree or disagree? If you disagree, could you explain what you think it should be. Thanks.
WAN-based Hadoop high availability (HA)?
Hello, I'm a market analyst involved in researching the Hadoop space, had a quick question. I was wondering if and what type of requirements may there be for WAN-based high availability for Hadoop configurations e.g. for disaster recovery and what type of solutions may be available for such applications? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
Hadoop BC/DR options?
Hello, I was researching the area of high availability options for Hadoop and ran into a quick question: what type of solutions are available for business continuity and disaster recovery for Hadoop and what recovery time objectives (RTO) do such solutions support? thanks, Saqib
RE: Hadoop cluster network requirement
Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Saqib -Original Message- From: jonathan.hw...@accenture.com [mailto:jonathan.hw...@accenture.com] Sent: Sunday, July 31, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Hadoop cluster network requirement I was asked by our IT folks if we can put hadoop name nodes storage using a shared disk storage unit. Does anyone have experience of how much IO throughput is required on the name nodes? What are the latency/data throughput requirements between the master and data nodes - can this tolerate network routing? Did anyone published any throughput requirement for the best network setup recommendation? Thanks! Jonathan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Sanity check re: value of 10GbE NICs for Hadoop?
Folks, I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
RE: Sanity check re: value of 10GbE NICs for Hadoop?
Darren, Thanks, the last pt was basically about 10GbE potentially allowing the use of a network file system e.g. via NFS as an alternative to HDFS, the question is there any merit in this. Basically, I was exploring if the commercial clustered NAS products offer any high-availability or data management benefits for use with Hadoop? Saqib -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Tuesday, June 28, 2011 10:21 AM To: common-user@hadoop.apache.org Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop, like other parallel networked computation architectures is I/O bound, predominantly. This means any increase in network bandwidth is A Good Thing and can have drastic positive effects on performance. All your points stem from this simple realization. Although I'm confused by your #6. Hadoop already uses a distributed file system. HDFS. On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: Folks, I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
RE: Sanity check re: value of 10GbE NICs for Hadoop?
Matt, Thanks, this is helpful, I was wondering if you may have some thoughts on the list of other potential benefits of 10GbE NICs for Hadoop (listed in my original e-mail to the list)? regards, Saqib -Original Message- From: Matthew Foley [mailto:ma...@yahoo-inc.com] Sent: Tuesday, June 28, 2011 12:04 PM To: common-user@hadoop.apache.org Cc: Matthew Foley Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop common provides an abstract FileSystem class, and Hadoop applications should be designed to run on that. HDFS is just one implementation of a valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage would fall under the LocalFileSystem model. However, one of the core values of Hadoop is the model of bring the computation to the data. This does not seem viable with an NFS-based NAS-model storage subsystem. Thus, while it will work for small clusters and small jobs, it is unlikely to scale with high performance to thousands of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. --Matt On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: I see. However, Hadoop is designed to operate best with HDFS because of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: Darren, Thanks, the last pt was basically about 10GbE potentially allowing the use of a network file system e.g. via NFS as an alternative to HDFS, the question is there any merit in this. Basically, I was exploring if the commercial clustered NAS products offer any high-availability or data management benefits for use with Hadoop? Saqib -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Tuesday, June 28, 2011 10:21 AM To: common-user@hadoop.apache.org Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop, like other parallel networked computation architectures is I/O bound, predominantly. This means any increase in network bandwidth is A Good Thing and can have drastic positive effects on performance. All your points stem from this simple realization. Although I'm confused by your #6. Hadoop already uses a distributed file system. HDFS. On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: Folks, I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com