Re: recommendation on HDDs

2011-02-12 Thread Edward Capriolo
On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning tdunn...@maprtech.com wrote:
 Bandwidth is definitely better with more active spindles.  I would recommend
 several larger disks.  The cost is very nearly the same.

 On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi jshrini...@gmail.comwrote:

 Thanks for your inputs, Michael.  We have 6 open SATA ports on the
 motherboards. That is the reason why we are thinking of 4 to 5 data disks
 and 1 OS disk.
 Are you suggesting use of one 2TB disk instead of four 500GB disks lets
 say?
 I thought that the HDFS utilization/throughput increases with the # of
 disks
 per node (assuming that the total usable IO bandwidth increases
 proportionally).

 -Shrinivas

 On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel michael_se...@hotmail.com
 wrote:

 
  Shrinivas,
 
  Assuming you're in the US, I'd recommend the following:
 
  Go with 2TB 7200 SATA hard drives.
  (Not sure what type of hardware you have)
 
  What  we've found is that in the data nodes, there's an optimal
  configuration that balances price versus performance.
 
  While your chasis may hold 8 drives, how many open SATA ports are on the
  motherboard? Since you're using JBOD, you don't want the additional
 expense
  of having to purchase a separate controller card for the additional
 drives.
 
  I'm running Seagate drives at home and I haven't had any problems for
  years.
  When you look at your drive, you need to know total storage, speed
 (rpms),
  and cache size.
  Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
  1TB Seagate was 70.00
  A 250GB SATA drive was $45.00
 
  So 2TB = 110, 140, 180 (respectively)
 
  So you get a better deal on 2TB.
 
  So if you go out and get more drives but of lower density, you'll end up
  spending more money and use more energy, but I doubt you'll see a real
  performance difference.
 
  The other thing is that if you want to add more disk, you have room to
  grow. (Just add more disk and restart the node, right?)
  If all of your disk slots are filled, you're SOL. You have to take out
 the
  box, replace all of the drives, then add to cluster as 'new' node.
 
  Just my $0.02 cents.
 
  HTH
 
  -Mike
 
   Date: Thu, 10 Feb 2011 15:47:16 -0600
   Subject: Re: recommendation on HDDs
   From: jshrini...@gmail.com
   To: common-user@hadoop.apache.org
  
   Hi Ted, Chris,
  
   Much appreciate your quick reply. The reason why we are looking for
  smaller
   capacity drives is because we are not anticipating a huge growth in
 data
   footprint and also read somewhere that larger the capacity of the
 drive,
   bigger the number of platters in them and that could affect drive
   performance. But looks like you can get 1TB drives with only 2
 platters.
   Large capacity drives should be OK for us as long as they perform
 equally
   well.
  
   Also, the systems that we have can host up to 8 SATA drives in them. In
  that
   case, would  backplanes offer additional advantages?
  
   Any suggestions on 5400 vs. 7200 vs. 1 RPM disks?  I guess 10K rpm
  disks
   would be overkill comparing their perf/cost advantage?
  
   Thanks for your inputs.
  
   -Shrinivas
  
   On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins 
  chris_j_coll...@yahoo.comwrote:
  
Of late we have had serious issues with seagate drives in our hadoop
cluster.  These were purchased over several purchasing cycles and
  pretty
sure it wasnt just a single bad batch.   Because of this we
 switched
  to
buying 2TB hitachi drives which seem to of been considerably more
  reliable.
   
Best
   
C
On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:
   
 Get bigger disks.  Data only grows and having extra is always good.

 You can get 2TB drives for $100 and 1TB for  $75.

 As far as transfer rates are concerned, any 3GB/s SATA drive is
 going
  to
be
 about the same (ish).  Seek times will vary a bit with rotation
  speed,
but
 with Hadoop, you will be doing long reads and writes.

 Your controller and backplane will have a MUCH bigger vote in
 getting
 acceptable performance.  With only 4 or 5 drives, you don't have to
  worry
 about super-duper backplane, but you can still kill performance
 with
  a
lousy
 controller.

 On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi 
  jshrini...@gmail.com
wrote:

 What would be a good hard drive for a 7 node cluster which is
  targeted
to
 run a mix of IO and CPU intensive Hadoop workloads? We are looking
  for
 around 1 TB of storage on each node distributed amongst 4 or 5
  disks. So
 either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
  than
100$
 each ;)

 I looked at HDD benchmark comparisons on tomshardware,
 storagereview
etc.
 Got overwhelmed with the # of benchmarks and different aspects of
  HDD
 performance.

 Appreciate your help on this.

 -Shrinivas

   
   
   
 
 




RE: recommendation on HDDs

2011-02-12 Thread Michael Segel

All, 

I'd like to clarify somethings...

First the concept is to build out a cluster of commodity hardware. 
So when you do your shopping you want to get the most bang for your buck. That 
is the 'sweet spot' that I'm talking about.
When you look at your E5500 or E5600 chip sets, you will want to go with 4 
cores per CPU, dual CPU and a clock speed around 2.53GHz or so.
(Faster chips are more expensive and the performance edge falls off so you end 
up paying a premium.)

Looking at your disks, you start with using the on board SATA controller. Why? 
Because it means you don't have to pay for a controller card. 
If you are building a cluster for general purpose computing... Assuming 1U 
boxes you have room for 4 3.5 SATA which still give you the best performance 
for your buck.
Can you go with 2.5? Yes, but you are going to be paying a premium.

Price wise, a 2TB SATA II 7200 RPM drive is going to be your best deal. You 
could go with SATA III drives if your motherboard supports the SATA III ports, 
but you're still paying a slight premium.

The OP felt that all he would need was 1TB of disk and was considering 4 250GB 
drives. (More spindles...yada yada yada...)

My suggestion is to forget that nonsense and go with one 2 TB drive because its 
a better deal and if you want to add more disk to the node, you can. (Its 
easier to add disk than it is to replace it.)

Now do you need to create a spare OS drive? No. Some people who have an 
internal 3.5 space sometimes do. That's ok, and you can put your hadoop logging 
there. (Just make sure you have a lot of disk space...)

The truth is that there really isn't any single *right* answer. There are a lot 
of options and budget constraints as well as physical constraints like power, 
space, and location of the hardware.

Also you may be building out a cluster who's main purpose is to be a backup 
location for your cluster. So your production cluster has lots of nodes. Your 
backup cluster has lots of disks per node because your main focus is as much 
storage per node.

So here you may end up buying a 4U rack box, load it up with 3.5 drives and a 
couple of SATA controller cards. You care less about performance but more about 
storage space. Here you may say 3TB SATA drives w 12 or more per box. (I don't 
know how many you can fit in to a 4U chassis these days.  So you have 10 DN 
backing up a 100+ DN cluster in your main data center. But that's another story.

I think the main take away you should have is that if you look at the price 
point... your best price per GB is on a 2TB drive until the prices drop on 3TB 
drives.
Since the OP believes that their requirement is 1TB per node... a single 2TB 
would be the best choice. It allows for additional space and you really 
shouldn't be too worried about disk i/o being your bottleneck.

HTH

-Mike


 Date: Sat, 12 Feb 2011 10:42:50 -0500
 Subject: Re: recommendation on HDDs
 From: edlinuxg...@gmail.com
 To: common-user@hadoop.apache.org
 
 On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning tdunn...@maprtech.com wrote:
  Bandwidth is definitely better with more active spindles.  I would recommend
  several larger disks.  The cost is very nearly the same.
 
  On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi 
  jshrini...@gmail.comwrote:
 
  Thanks for your inputs, Michael.  We have 6 open SATA ports on the
  motherboards. That is the reason why we are thinking of 4 to 5 data disks
  and 1 OS disk.
  Are you suggesting use of one 2TB disk instead of four 500GB disks lets
  say?
  I thought that the HDFS utilization/throughput increases with the # of
  disks
  per node (assuming that the total usable IO bandwidth increases
  proportionally).
 
  -Shrinivas
 
  On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel michael_se...@hotmail.com
  wrote:
 
  
   Shrinivas,
  
   Assuming you're in the US, I'd recommend the following:
  
   Go with 2TB 7200 SATA hard drives.
   (Not sure what type of hardware you have)
  
   What  we've found is that in the data nodes, there's an optimal
   configuration that balances price versus performance.
  
   While your chasis may hold 8 drives, how many open SATA ports are on the
   motherboard? Since you're using JBOD, you don't want the additional
  expense
   of having to purchase a separate controller card for the additional
  drives.
  
   I'm running Seagate drives at home and I haven't had any problems for
   years.
   When you look at your drive, you need to know total storage, speed
  (rpms),
   and cache size.
   Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
   1TB Seagate was 70.00
   A 250GB SATA drive was $45.00
  
   So 2TB = 110, 140, 180 (respectively)
  
   So you get a better deal on 2TB.
  
   So if you go out and get more drives but of lower density, you'll end up
   spending more money and use more energy, but I doubt you'll see a real
   performance difference.
  
   The other thing is that if you want to add more disk, you have room to
   grow. 

Re: recommendation on HDDs

2011-02-12 Thread James Seigel
The only thing of concern is that the hdfs stuff doesn't seem to do
exceptionally well with different sized disks in practice

James

Sent from my mobile. Please excuse the typos.

On 2011-02-12, at 8:43 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning tdunn...@maprtech.com wrote:
 Bandwidth is definitely better with more active spindles.  I would recommend
 several larger disks.  The cost is very nearly the same.

 On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi jshrini...@gmail.comwrote:

 Thanks for your inputs, Michael.  We have 6 open SATA ports on the
 motherboards. That is the reason why we are thinking of 4 to 5 data disks
 and 1 OS disk.
 Are you suggesting use of one 2TB disk instead of four 500GB disks lets
 say?
 I thought that the HDFS utilization/throughput increases with the # of
 disks
 per node (assuming that the total usable IO bandwidth increases
 proportionally).

 -Shrinivas

 On Thu, Feb 10, 2011 at 4:25 PM, Michael Segel michael_se...@hotmail.com
 wrote:


 Shrinivas,

 Assuming you're in the US, I'd recommend the following:

 Go with 2TB 7200 SATA hard drives.
 (Not sure what type of hardware you have)

 What  we've found is that in the data nodes, there's an optimal
 configuration that balances price versus performance.

 While your chasis may hold 8 drives, how many open SATA ports are on the
 motherboard? Since you're using JBOD, you don't want the additional
 expense
 of having to purchase a separate controller card for the additional
 drives.

 I'm running Seagate drives at home and I haven't had any problems for
 years.
 When you look at your drive, you need to know total storage, speed
 (rpms),
 and cache size.
 Looking at Microcenter's pricing... 2TB 3.0GB SATA Hitachi was $110.00 A
 1TB Seagate was 70.00
 A 250GB SATA drive was $45.00

 So 2TB = 110, 140, 180 (respectively)

 So you get a better deal on 2TB.

 So if you go out and get more drives but of lower density, you'll end up
 spending more money and use more energy, but I doubt you'll see a real
 performance difference.

 The other thing is that if you want to add more disk, you have room to
 grow. (Just add more disk and restart the node, right?)
 If all of your disk slots are filled, you're SOL. You have to take out
 the
 box, replace all of the drives, then add to cluster as 'new' node.

 Just my $0.02 cents.

 HTH

 -Mike

 Date: Thu, 10 Feb 2011 15:47:16 -0600
 Subject: Re: recommendation on HDDs
 From: jshrini...@gmail.com
 To: common-user@hadoop.apache.org

 Hi Ted, Chris,

 Much appreciate your quick reply. The reason why we are looking for
 smaller
 capacity drives is because we are not anticipating a huge growth in
 data
 footprint and also read somewhere that larger the capacity of the
 drive,
 bigger the number of platters in them and that could affect drive
 performance. But looks like you can get 1TB drives with only 2
 platters.
 Large capacity drives should be OK for us as long as they perform
 equally
 well.

 Also, the systems that we have can host up to 8 SATA drives in them. In
 that
 case, would  backplanes offer additional advantages?

 Any suggestions on 5400 vs. 7200 vs. 1 RPM disks?  I guess 10K rpm
 disks
 would be overkill comparing their perf/cost advantage?

 Thanks for your inputs.

 -Shrinivas

 On Thu, Feb 10, 2011 at 2:48 PM, Chris Collins 
 chris_j_coll...@yahoo.comwrote:

 Of late we have had serious issues with seagate drives in our hadoop
 cluster.  These were purchased over several purchasing cycles and
 pretty
 sure it wasnt just a single bad batch.   Because of this we
 switched
 to
 buying 2TB hitachi drives which seem to of been considerably more
 reliable.

 Best

 C
 On Feb 10, 2011, at 12:43 PM, Ted Dunning wrote:

 Get bigger disks.  Data only grows and having extra is always good.

 You can get 2TB drives for $100 and 1TB for  $75.

 As far as transfer rates are concerned, any 3GB/s SATA drive is
 going
 to
 be
 about the same (ish).  Seek times will vary a bit with rotation
 speed,
 but
 with Hadoop, you will be doing long reads and writes.

 Your controller and backplane will have a MUCH bigger vote in
 getting
 acceptable performance.  With only 4 or 5 drives, you don't have to
 worry
 about super-duper backplane, but you can still kill performance
 with
 a
 lousy
 controller.

 On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi 
 jshrini...@gmail.com
 wrote:

 What would be a good hard drive for a 7 node cluster which is
 targeted
 to
 run a mix of IO and CPU intensive Hadoop workloads? We are looking
 for
 around 1 TB of storage on each node distributed amongst 4 or 5
 disks. So
 either 250GB * 4 disks or 160GB * 5 disks. Also it should be less
 than
 100$
 each ;)

 I looked at HDD benchmark comparisons on tomshardware,
 storagereview
 etc.
 Got overwhelmed with the # of benchmarks and different aspects of
 HDD
 performance.

 Appreciate your help on this.

 -Shrinivas









 You also do not need a dedicated OS 

Re: recommendation on HDDs

2011-02-12 Thread Ted Dunning
The original poster also seemed somewhat interested in disk bandwidth.

That is facilitated by having more than on disk in the box.

On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel michael_se...@hotmail.comwrote:

 Since the OP believes that their requirement is 1TB per node... a single
 2TB would be the best choice. It allows for additional space and you really
 shouldn't be too worried about disk i/o being your bottleneck.


Re: Which strategy is proper to run an this enviroment?

2011-02-12 Thread Ted Dunning
This sounds like it will be very inefficient.  There is considerable
overhead in starting Hadoop jobs.  As you describe it, you will be starting
thousands of jobs and paying this penalty many times.

Is there a way that you could process all of the directories in one
map-reduce job?  Can you combine these directories into a single directory
with a few large files?

On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim juneng...@gmail.com wrote:

 Hi.

 I have small clusters (9 nodes) to run a hadoop here.

 Under this cluster, a hadoop will take thousands of directories sequencely.

 In a each dir, there is two input files to m/r. Size of input files are
 from
 1m to 5g bytes.
 In a summary, each hadoop job will take an one of these dirs.

 To get best performance, which strategy is proper for us?

 Could u suggest me about it?
 Which configuration is best?

 Ps) physical memory size is 12g of each node.