RE: Most Practical Bandwidth for a Hadoop Cluster?

2012-03-18 Thread Saqib Jang -- Margalla Communications
Jeff,

I've seen tests showing that 10 Gigabit Ethernet networking benefits

Hadoop clusters and the benefit is especially pronounced if Hadoop

node use SSDs on the back-end. Also, as each node does bother storage

I/O and processing, 10GbE NICs that offload protocol processing

are especially beneficial e.g

 

High-Performance Networking for Optimized Hadoop Deployments
http://www.chelsio.com/wp-content/uploads/2011/08/Hadoop-White-Paper-w-tuto
rial-8.11.pdf 

 

Saqib

 

-Original Message-
From: Jeff Kubina [mailto:jeff.kub...@gmail.com] 
Sent: Thursday, March 15, 2012 11:30 AM
To: common-user@hadoop.apache.org
Subject: Most Practical Bandwidth for a Hadoop Cluster?

 

Suppose you have Hadoop jobs that are communication-bound (due to lots of
data shuffling between maps and reduces), what is the most practical network
bandwidth to strive for in such a cluster? I think it should be the
sustained read bandwidth of the disks on the nodes times the number of
nodes, since any more bandwidth than this could not be utilized. Agree or
disagree? If you disagree, could you explain what you think it should be.
Thanks.



WAN-based Hadoop high availability (HA)?

2012-02-21 Thread Saqib Jang -- Margalla Communications
Hello,

I'm a market analyst involved in researching the Hadoop space, had

a quick question. I was wondering if and what type of requirements may

there be for WAN-based high availability for Hadoop configurations

e.g. for disaster recovery and what type of solutions may be available

for such applications?

 

thanks,

Saqib

 

Saqib Jang

Principal/Founder

Margalla Communications, Inc.

1339 Portola Road, Woodside, CA 94062

(650) 274 8745

www.margallacomm.com

 

 



Hadoop BC/DR options?

2011-10-24 Thread Saqib Jang -- Margalla Communications
Hello,
I was researching the area of high availability options for Hadoop
and ran into a quick question: what type of solutions are available
for business continuity and disaster recovery for Hadoop and what
recovery time objectives (RTO) do such solutions support?

thanks,
Saqib





RE: Hadoop cluster network requirement

2011-07-31 Thread Saqib Jang -- Margalla Communications
Thanks, I'm independently doing some digging into Hadoop networking
requirements and 
had a couple of quick follow-ups. Could I have some specific info on why
different data centers 
cannot be supported for master node and data node comms? Also, what 
may be the benefits/use cases for such a scenario?

Saqib

-Original Message-
From: jonathan.hw...@accenture.com [mailto:jonathan.hw...@accenture.com] 
Sent: Sunday, July 31, 2011 12:09 PM
To: common-user@hadoop.apache.org
Subject: Hadoop cluster network requirement

I was asked by our IT folks if we can put hadoop name nodes storage using a
shared disk storage unit.  Does anyone have experience of how much IO
throughput is required on the name nodes?  What are the latency/data
throughput requirements between the master and data nodes - can this
tolerate network routing?

Did anyone published any throughput requirement for the best network setup
recommendation?

Thanks!
Jonathan



This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.



Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Folks,

I've been digging into the potential benefits of using 

10 Gigabit Ethernet (10GbE) NIC server connections for

Hadoop and wanted to run what I've come up with

through initial research by the list for 'sanity check'

feedback. I'd very much appreciate your input on

the importance (or lack of it) of the following potential benefits of

10GbE server connectivity as well as other thoughts regarding

10GbE and Hadoop (My interest is specifically in the value

of 10GbE server connections and 10GbE switching infrastructure, 

over scenarios such as bonded 1GbE server connections with 

10GbE switching).

 

1.   HDFS Data Loading. The higher throughput enabled by 10GbE

server and switching infrastructure allows faster processing and 

distribution of data.

2.   Hadoop Cluster Scalability. High-performance for initial data
processing

and distribution directly impacts the degree of parallelism or scalability
supported

by the cluster.

3.   HDFS Replication. Higher speed server connections allows faster
file replication.

4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
latency directly impact the 

shuffle phase of a data set reduction especially for tasks that are at the
document level 

(including large documents) and lots of metadata generated by those
documents as well as video analytics and images.

5.   Data Reporting. 10GbE server networking etwork performance can 

improve data reporting performance, especially if the Hadoop cluster is
running 

multiple data reductions. 

6.   Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could be
reorganized 

to use a cluster or network file system. This would allow Hadoop even with
its Java implementation 

to have higher performance I/O and not have to be so concerned with disk
drive density in the same server.

7.   Others?

 

 

thanks,

Saqib

 

Saqib Jang

Principal/Founder

Margalla Communications, Inc.

1339 Portola Road, Woodside, CA 94062

(650) 274 8745

www.margallacomm.com

 

 



RE: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Darren,
Thanks, the last pt was basically about 10GbE potentially allowing the use
of a network file system e.g. via NFS as an alternative to HDFS, the
question
is there any merit in this. Basically, I was exploring if the commercial
clustered
NAS products offer any high-availability or data management benefits for use
with Hadoop?

Saqib

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: Tuesday, June 28, 2011 10:21 AM
To: common-user@hadoop.apache.org
Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

Hadoop, like other parallel networked computation architectures is I/O
bound, predominantly.
This means any increase in network bandwidth is A Good Thing and can have
drastic positive effects on performance. All your points stem from this
simple realization.

Although I'm confused by your #6. Hadoop already uses a distributed file
system. HDFS.

On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
 Folks,

 I've been digging into the potential benefits of using

 10 Gigabit Ethernet (10GbE) NIC server connections for

 Hadoop and wanted to run what I've come up with

 through initial research by the list for 'sanity check'

 feedback. I'd very much appreciate your input on

 the importance (or lack of it) of the following potential benefits of

 10GbE server connectivity as well as other thoughts regarding

 10GbE and Hadoop (My interest is specifically in the value

 of 10GbE server connections and 10GbE switching infrastructure,

 over scenarios such as bonded 1GbE server connections with

 10GbE switching).



 1.   HDFS Data Loading. The higher throughput enabled by 10GbE

 server and switching infrastructure allows faster processing and

 distribution of data.

 2.   Hadoop Cluster Scalability. High-performance for initial data
 processing

 and distribution directly impacts the degree of parallelism or 
 scalability supported

 by the cluster.

 3.   HDFS Replication. Higher speed server connections allows faster
 file replication.

 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
 latency directly impact the

 shuffle phase of a data set reduction especially for tasks that are at 
 the document level

 (including large documents) and lots of metadata generated by those 
 documents as well as video analytics and images.

 5.   Data Reporting. 10GbE server networking etwork performance can

 improve data reporting performance, especially if the Hadoop cluster 
 is running

 multiple data reductions.

 6.   Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could
be
 reorganized

 to use a cluster or network file system. This would allow Hadoop even 
 with its Java implementation

 to have higher performance I/O and not have to be so concerned with 
 disk drive density in the same server.

 7.   Others?





 thanks,

 Saqib



 Saqib Jang

 Principal/Founder

 Margalla Communications, Inc.

 1339 Portola Road, Woodside, CA 94062

 (650) 274 8745

 www.margallacomm.com










RE: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Matt,
Thanks, this is helpful, I was wondering if you may have some thoughts
on the list of other potential benefits of 10GbE NICs for Hadoop
(listed in my original e-mail to the list)?

regards,
Saqib

-Original Message-
From: Matthew Foley [mailto:ma...@yahoo-inc.com] 
Sent: Tuesday, June 28, 2011 12:04 PM
To: common-user@hadoop.apache.org
Cc: Matthew Foley
Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?

Hadoop common provides an abstract FileSystem class, and Hadoop applications
should be designed to run on that.  HDFS is just one implementation of a
valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
would fall under the LocalFileSystem model.

However, one of the core values of Hadoop is the model of bring the
computation to the data.  This does not seem viable with an NFS-based
NAS-model storage subsystem.  Thus, while it will work for small clusters
and small jobs, it is unlikely to scale with high performance to thousands
of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.

--Matt


On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:

I see. However, Hadoop is designed to operate best with HDFS because of its
inherent striping and blocking strategy - which is tracked by Hadoop.
Going outside of that mechanism will probably yield poor results and/or
confuse Hadoop.

Just my thoughts.

On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
 Darren,
 Thanks, the last pt was basically about 10GbE potentially allowing the 
 use of a network file system e.g. via NFS as an alternative to HDFS, 
 the question is there any merit in this. Basically, I was exploring if 
 the commercial clustered NAS products offer any high-availability or 
 data management benefits for use with Hadoop?
 
 Saqib
 
 -Original Message-
 From: Darren Govoni [mailto:dar...@ontrenet.com]
 Sent: Tuesday, June 28, 2011 10:21 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
 
 Hadoop, like other parallel networked computation architectures is I/O 
 bound, predominantly.
 This means any increase in network bandwidth is A Good Thing and can 
 have drastic positive effects on performance. All your points stem 
 from this simple realization.
 
 Although I'm confused by your #6. Hadoop already uses a distributed 
 file system. HDFS.
 
 On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
 Folks,
 
 I've been digging into the potential benefits of using
 
 10 Gigabit Ethernet (10GbE) NIC server connections for
 
 Hadoop and wanted to run what I've come up with
 
 through initial research by the list for 'sanity check'
 
 feedback. I'd very much appreciate your input on
 
 the importance (or lack of it) of the following potential benefits of
 
 10GbE server connectivity as well as other thoughts regarding
 
 10GbE and Hadoop (My interest is specifically in the value
 
 of 10GbE server connections and 10GbE switching infrastructure,
 
 over scenarios such as bonded 1GbE server connections with
 
 10GbE switching).
 
 
 
 1.   HDFS Data Loading. The higher throughput enabled by 10GbE
 
 server and switching infrastructure allows faster processing and
 
 distribution of data.
 
 2.   Hadoop Cluster Scalability. High-performance for initial data
 processing
 
 and distribution directly impacts the degree of parallelism or 
 scalability supported
 
 by the cluster.
 
 3.   HDFS Replication. Higher speed server connections allows faster
 file replication.
 
 4.   Map/Reduce Shuffle Phase. Improved end-to-end throughput and
 latency directly impact the
 
 shuffle phase of a data set reduction especially for tasks that are 
 at the document level
 
 (including large documents) and lots of metadata generated by those 
 documents as well as video analytics and images.
 
 5.   Data Reporting. 10GbE server networking etwork performance can
 
 improve data reporting performance, especially if the Hadoop cluster 
 is running
 
 multiple data reductions.
 
 6.   Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could
 be
 reorganized
 
 to use a cluster or network file system. This would allow Hadoop even 
 with its Java implementation
 
 to have higher performance I/O and not have to be so concerned with 
 disk drive density in the same server.
 
 7.   Others?
 
 
 
 
 
 thanks,
 
 Saqib
 
 
 
 Saqib Jang
 
 Principal/Founder
 
 Margalla Communications, Inc.
 
 1339 Portola Road, Woodside, CA 94062
 
 (650) 274 8745
 
 www.margallacomm.com