Re: FAQ for New to Hadoop

2010-07-11 Thread Alex Baranau
Ken,

You can also take a look at the FAQ section in the posts we publish
periodically. It started with
http://blog.sematext.com/2010/02/16/hadoop-digest-february-2010/. The
frequently asked questions are mainly retrieved from the project's user
mailing lists.

We also cover HBase (you can find posts on http://blog.sematext.com as well.

Alex Baranau

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/


On Fri, Jul 9, 2010 at 1:35 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Cool, Ken, thank you, I think it is very useful.

 Mark

 On Thu, Jul 8, 2010 at 4:35 PM, Ken Krugler kkrugler_li...@transpac.com
 wrote:

  Hi all,
 
  I recently hosted an Intro to Hadoop session at the BigDataCamp
  unconference last week. I later wrote down questions from the audience
 that
  seemed useful to other Hadoop beginners, and the compared this to the
 Hadoop
  project FAQ at http://wiki.apache.org/hadoop/FAQ
 
  There was overlap, but not as much as I expected - the Hadoop FAQ has
 more
  how do I do X versus can I do X or why should I do X.
 
  I posted these questions to
  http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp , and
  would appreciate any input - e.g. questions you think should be there,
  answers you think aren't very clear (though mea culpa in advance, I
 jotted
  these down quickly so I realize they're pretty rough).
 
  Thanks,
 
  -- Ken
 
  
  Ken Krugler
  +1 530-210-6378
  http://bixolabs.com
  e l a s t i c   w e b   m i n i n g
 
 



FAQ for New to Hadoop

2010-07-08 Thread Ken Krugler

Hi all,

I recently hosted an Intro to Hadoop session at the BigDataCamp  
unconference last week. I later wrote down questions from the audience  
that seemed useful to other Hadoop beginners, and the compared this to  
the Hadoop project FAQ at http://wiki.apache.org/hadoop/FAQ


There was overlap, but not as much as I expected - the Hadoop FAQ has  
more how do I do X versus can I do X or why should I do X.


I posted these questions to http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp 
 , and would appreciate any input - e.g. questions you think should  
be there, answers you think aren't very clear (though mea culpa in  
advance, I jotted these down quickly so I realize they're pretty rough).


Thanks,

-- Ken


Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g



Re: FAQ for New to Hadoop

2010-07-08 Thread Mark Kerzner
Cool, Ken, thank you, I think it is very useful.

Mark

On Thu, Jul 8, 2010 at 4:35 PM, Ken Krugler kkrugler_li...@transpac.comwrote:

 Hi all,

 I recently hosted an Intro to Hadoop session at the BigDataCamp
 unconference last week. I later wrote down questions from the audience that
 seemed useful to other Hadoop beginners, and the compared this to the Hadoop
 project FAQ at http://wiki.apache.org/hadoop/FAQ

 There was overlap, but not as much as I expected - the Hadoop FAQ has more
 how do I do X versus can I do X or why should I do X.

 I posted these questions to
 http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp , and
 would appreciate any input - e.g. questions you think should be there,
 answers you think aren't very clear (though mea culpa in advance, I jotted
 these down quickly so I realize they're pretty rough).

 Thanks,

 -- Ken

 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g




new to hadoop

2010-05-04 Thread jamborta

Hi,

I am tring to set up a small hadoop cluster with 6 machines. the problem I
have now is that if I set the memory allocated to a task low (e.g -Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send them to these
weaker machines, which brings the whole cluster down. 
my question is whether it is possible to specify -Xmx for each machine in
the cluster and specify how many task can run on a machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

-- 
View this message in context: 
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: new to hadoop

2010-05-04 Thread Ravi Phulari
How much RAM ?
With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal 
guess).

-
Ravi

On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote:

thank you. so what would be the optimal setting for mapred.map.tasks and 
mapred.reduce.tasks, say, on a dual-core machine?

Tom

On 05/05/2010 00:12, Ravi Phulari wrote:
Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on 
each node to specify -Xmx values.
You can use conf/mapred-site.xml to configure default mappers and reducers 
running on a node.

property
  namemapred.map.tasks/name
  value2/value
  descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
  /description
/property

property
  namemapred.reduce.tasks/name
  value1/value
  descriptionThe default number of reduce tasks per job. Typically set to 99%
  of the cluster's reduce capacity, so that if a node fails the reduces can
  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
  /description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:




Hi,

I am tring to set up a small hadoop cluster with 6 machines. the problem I
have now is that if I set the memory allocated to a task low (e.g -Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for each machine in
the cluster and specify how many task can run on a machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context: 
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Ravi

Ravi
--



Re: new to hadoop

2010-05-04 Thread Tamas Jambor
thank you. so what would be the optimal setting for mapred.map.tasks and 
mapred.reduce.tasks, say, on a dual-core machine?


Tom

On 05/05/2010 00:12, Ravi Phulari wrote:
You can configure (conf/hadoop-env.sh) configuration files on each 
node to specify --Xmx values.
You can use conf/mapred-site.xml to configure default mappers and 
reducers running on a node.


property
namemapred.map.tasks/name
value2/value
descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
/description
/property

property
namemapred.reduce.tasks/name
value1/value
descriptionThe default number of reduce tasks per job. Typically set 
to 99%
  of the cluster's reduce capacity, so that if a node fails the 
reduces can

  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
/description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:



Hi,

I am tring to set up a small hadoop cluster with 6 machines. the
problem I
have now is that if I set the memory allocated to a task low (e.g
-Xmx512m)
the application does not run, if I set it higher some machines in the
cluster only have not got too much memory (1 or 2GB) and when the
computation gets intensive hadoop create so many jobs and send
them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for each
machine in
the cluster and specify how many task can run on a machine. or
what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context:
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Ravi
--



Re: new to hadoop

2010-05-04 Thread Tamas Jambor

great. thank you. I'll set it up that way.

Tom

On 05/05/2010 00:37, Ravi Phulari wrote:

How much RAM ?
With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my 
personal guess).


-
Ravi

On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote:

thank you. so what would be the optimal setting for
mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine?

Tom

On 05/05/2010 00:12, Ravi Phulari wrote:

Re: new to hadoop You can configure (conf/hadoop-env.sh)
configuration files on each node to specify --Xmx values.
You can use conf/mapred-site.xml to configure default mappers
and reducers running on a node.

property
namemapred.map.tasks/name
value2/value
descriptionThe default number of map tasks per job.
  Ignored when mapred.job.tracker is local.
/description
/property

property
namemapred.reduce.tasks/name
value1/value
descriptionThe default number of reduce tasks per job.
Typically set to 99%
  of the cluster's reduce capacity, so that if a node fails
the reduces can
  still be executed in a single wave.
  Ignored when mapred.job.tracker is local.
/description
/property


-
Ravi

On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote:




Hi,

I am tring to set up a small hadoop cluster with 6
machines. the problem I
have now is that if I set the memory allocated to a task
low (e.g -Xmx512m)
the application does not run, if I set it higher some
machines in the
cluster only have not got too much memory (1 or 2GB) and
when the
computation gets intensive hadoop create so many jobs and
send them to these
weaker machines, which brings the whole cluster down.
my question is whether it is possible to specify -Xmx for
each machine in
the cluster and specify how many task can run on a
machine. or what is the
optimal setting in this situation?

thanks for your help

Tom

--
View this message in context:
http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html
Sent from the Hadoop core-user mailing list archive at
Nabble.com.




Ravi


Ravi
--





Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Steve Loughran

Kevin Sweeney wrote:

I really appreciate everyone's input. We've been going back and forth on the
server size issue here. There are a few reasons we shot for the $1k price,
one because we wanted to be able to compare our datacenter costs vs. the
cloud costs. Another is that we have spec'd out a fast Intel node with
over-the-counter parts. We have a hard time justifying the dual-processor
costs and really don't see the need for the big server extras like
out-of-band management and redundancy. This is our proposed config, feel
free to criticize :)
Supermicro 512L-260 Chassis $90
Supermicro X8SIL  $160
Heatsink$22
Intel 3460 Xeon  $350
Samsung 7200 RPM SATA2   2x$85
2GB Non-ECC DIMM  4x$65

This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
purpose of a hadoop cluster to build cheap,fast, replaceable nodes?


Disclaimer 1: I work for a server vendor so may be biased. I will 
attempt to avoid this by not pointing you at HP DL180 or SL170z servers.


Disclaimer 2: I probably don't know what I'm talking about. As far as 
Hadoop concerned, I'm not sure anyone knows what is the right 
configuration.


* I'd consider ECC RAM. On a large cluster, over time, errors occur -you 
either notice them or propagate the effects.


* Worry about power, cooling and rack weight.

* Include network costs, power budget. That's your own switch costs, 
plus bandwidth in and out.


* There are some good arguments in favour of fewer, higher end machines 
over many smaller ones.  Less network traffic, often a higher density.


The  cloud hosted vs owned is an interesting question; I suspect the 
spreadsheet there is pretty complex


* Estimate how much data you will want to store over time. On S3, those 
costs ramp up fast; in your own rack you can maybe plan to stick in in 
an extra 2TB HDD a year from now (space, power, cooling and weight 
permitting), paying next year's prices for next year's capacity.


* Virtual machine management costs are different from physical 
management costs, especially if you dont invest time upfront on 
automating your datacentre software provisioning (custom RPMs, PXE 
preboot, kickstart, etc). VMMs you can almost hand manage an image 
(naughty, but possible), as long as you have a single image or two to 
push out. Even then, i'd automate, but at a higher level, creating 
images on demand as load/availablity sees fit.


-Steve




Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Ryan Smith
I have a question that i feel i should ask on this thread.  Lets say you
want to build a cluster where you will be doing very little map/reduce,
storage and replication of data only on hdfs.  What would the hardware
requirements be?  No quad core? less ram?

Thanks
-Ryan

On Thu, Oct 1, 2009 at 7:36 AM, tim robertson timrobertson...@gmail.comwrote:

 Disclaimer: I am pretty useless when it comes to hardware

 I had a lot of issues with non ECC memory when running 100's millions
 inserts from MapReduce into HBase on a dev cluster.  The errors were
 checksum errors, and the consensus was the memory was causing the
 issues and all advice was to ensure ECC memory.  The same cluster ran
 without (any apparent) error for simple counting operations on tab
 delimited files.

 Cheers,
 Tim

 On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran ste...@apache.org wrote:
  Kevin Sweeney wrote:
 
  I really appreciate everyone's input. We've been going back and forth on
  the
  server size issue here. There are a few reasons we shot for the $1k
 price,
  one because we wanted to be able to compare our datacenter costs vs. the
  cloud costs. Another is that we have spec'd out a fast Intel node with
  over-the-counter parts. We have a hard time justifying the
 dual-processor
  costs and really don't see the need for the big server extras like
  out-of-band management and redundancy. This is our proposed config, feel
  free to criticize :)
  Supermicro 512L-260 Chassis $90
  Supermicro X8SIL  $160
  Heatsink$22
  Intel 3460 Xeon  $350
  Samsung 7200 RPM SATA2   2x$85
  2GB Non-ECC DIMM  4x$65
 
  This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
  purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
 
  Disclaimer 1: I work for a server vendor so may be biased. I will attempt
 to
  avoid this by not pointing you at HP DL180 or SL170z servers.
 
  Disclaimer 2: I probably don't know what I'm talking about. As far as
 Hadoop
  concerned, I'm not sure anyone knows what is the right configuration.
 
  * I'd consider ECC RAM. On a large cluster, over time, errors occur -you
  either notice them or propagate the effects.
 
  * Worry about power, cooling and rack weight.
 
  * Include network costs, power budget. That's your own switch costs, plus
  bandwidth in and out.
 
  * There are some good arguments in favour of fewer, higher end machines
 over
  many smaller ones.  Less network traffic, often a higher density.
 
  The  cloud hosted vs owned is an interesting question; I suspect the
  spreadsheet there is pretty complex
 
  * Estimate how much data you will want to store over time. On S3, those
  costs ramp up fast; in your own rack you can maybe plan to stick in in an
  extra 2TB HDD a year from now (space, power, cooling and weight
 permitting),
  paying next year's prices for next year's capacity.
 
  * Virtual machine management costs are different from physical management
  costs, especially if you dont invest time upfront on automating your
  datacentre software provisioning (custom RPMs, PXE preboot, kickstart,
 etc).
  VMMs you can almost hand manage an image (naughty, but possible), as long
 as
  you have a single image or two to push out. Even then, i'd automate, but
 at
  a higher level, creating images on demand as load/availablity sees fit.
 
  -Steve
 
 
 



Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Steve Loughran

Ryan Smith wrote:

I have a question that i feel i should ask on this thread.  Lets say you
want to build a cluster where you will be doing very little map/reduce,
storage and replication of data only on hdfs.  What would the hardware
requirements be?  No quad core? less ram?



Servers with more HDD per CPU, and less RAM. CPUs are a big slice not 
just of capital, but of your power budget. If you are running a big 
datacentre, you will care about that electricity bill.


Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or 12 TB 
per U, then perhaps a 2-core or 4-core server with enough ECC RAM.


* with less M/R work, you could allocate most of that TB to work, leave 
a few hundred GB for OS and logs


* you'd better estimate external load; if the cluster is storing data 
then total network bandwidth will be 3X the data ingress (for 
replication = 3), read costs are that of the data itself. Also, 5 
threads on 3 different machines handing the write and forward process.


* I don't know how much load the datanode JVM would take with, say 11 TB 
of managed storage underneath; that's memory and CPU time.


Is anyone out there running big datanodes? What do they see?

-steve



Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Brian Bockelman


On Oct 1, 2009, at 7:13 AM, Steve Loughran wrote:


Ryan Smith wrote:
I have a question that i feel i should ask on this thread.  Lets  
say you
want to build a cluster where you will be doing very little map/ 
reduce,
storage and replication of data only on hdfs.  What would the  
hardware

requirements be?  No quad core? less ram?


Servers with more HDD per CPU, and less RAM. CPUs are a big slice  
not just of capital, but of your power budget. If you are running a  
big datacentre, you will care about that electricity bill.


Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or  
12 TB per U, then perhaps a 2-core or 4-core server with enough  
ECC RAM.


* with less M/R work, you could allocate most of that TB to work,  
leave a few hundred GB for OS and logs


* you'd better estimate external load; if the cluster is storing  
data then total network bandwidth will be 3X the data ingress (for  
replication = 3), read costs are that of the data itself. Also, 5  
threads on 3 different machines handing the write and forward process.


* I don't know how much load the datanode JVM would take with, say  
11 TB of managed storage underneath; that's memory and CPU time.



Datanode load is a function of the number of IOPS.  Basically, buying  
6 12TB nodes versus 3 24TB nodes, you double the number of IOPS per  
node.


If you're using HDFS solely for backup, then the number of IOPS is so  
small you can assume it's zero.  We use HDFS for a non-mapreduce  
physics application, and our particular application mix is such that I  
target 1 batch system core per usable HDFS TB.




Is anyone out there running big datanodes? What do they see?



Our biggest is 48TB:
* They go offline for 5 minutes during the block reports.  We use rack  
awareness to make sure that both copies are not on big data nodes.   
Fixed in future releases (0.20.0 even, maybe).
* When one disk goes out, the datanode shuts down - meaning that 48  
disks go out.  This is to be fixed in 0.21.0, I think.
* The CPUs (4 cores) are pegged when the system is under full load.   
If I had a chance, I'd give it more CPU horsepower.


As usual, everyone's application is different enough that any anecdote  
is possibly not applicable.


Brian





smime.p7s
Description: S/MIME cryptographic signature


Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Patrick Angeles
I wouldn't spec the worker nodes just to facilitate cloud cost comparison.
There's enough variability out there and you'd have to deal with storage,
network bandwidth and I/O. Not to mention a similarly spec'd virtual cloud
server will never perform as well as a physical server because you don't get
data locality. Unless you have something like Amazon's EBS, but then that
jacks up your costs.
Also, you shouldn't assume that 'big server' will include out-of-band
management or redundancy.

Also take into account performance per watt. Dual socket machines do better
here. Just like you, I wouldn't go with high ghz ('faster') Intel procs
because they are power hungry and generate lots of heat for the incremental
speed bump that you get. (After all, you're not building a gaming rig.)
However, you can go dual-socket with lower speed processors. I think the
lowest ghz Nehalems that support hyper-threading are good value. For
example, compare the Xeon 3460 @ 2.8ghz ($360) to the 3440 @ 2.53ghz ($240).
That's about a 10% speed bump for a 50% price increase, and that's without
factoring in the power consumption. Granted, you need to take into account
the cost of the entire server, not just the processor.


On Wed, Sep 30, 2009 at 6:46 PM, Kevin Sweeney ke...@yieldex.com wrote:

 I really appreciate everyone's input. We've been going back and forth on
 the
 server size issue here. There are a few reasons we shot for the $1k price,
 one because we wanted to be able to compare our datacenter costs vs. the
 cloud costs. Another is that we have spec'd out a fast Intel node with
 over-the-counter parts. We have a hard time justifying the dual-processor
 costs and really don't see the need for the big server extras like
 out-of-band management and redundancy. This is our proposed config, feel
 free to criticize :)
 Supermicro 512L-260 Chassis $90
 Supermicro X8SIL  $160
 Heatsink$22
 Intel 3460 Xeon  $350
 Samsung 7200 RPM SATA2   2x$85
 2GB Non-ECC DIMM  4x$65

 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
 purpose of a hadoop cluster to build cheap,fast, replaceable nodes?



 On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  2TB drives are just now dropping to parity with 1TB on a $/GB basis.
 
  If you want space rather than speed, this is a good option.  If you want
  speed rather than space, more spindles and smaller disks are better.
  Ironically, 500GB drives now often cost more than 1TB drives (that is $,
  not
  $/GB).
 
  On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
  patrickange...@gmail.comwrote:
 
   We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might
 be
   overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
   virtual cores so 12GB might not have been enough. These boxes are
 around
   $4k
   each, but can easily outperform any $1K box dollar per dollar (and
   performance per watt).
  
   If you're extremely I/O bound, you can get single-socket configurations
   with
   the same amount of drive spindles for really cheap (~$2k for single
 proc,
   8-12GB RAM, 4x1TB drives).
  
   On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
   stephen.mulc...@deri.orgwrote:
  
Todd Lipcon wrote:
   
Most people building new clusters at this point seem to be leaning
   towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
   
   
We went with a similar configuration for a recently purchased cluster
  but
opted for qual quad core Opterons (Shanghai) rather than Nehalems and
invested the difference in more memory per node (16GB). Nehalem seem
 to
perform very well on some benchmarks but that performance comes at a
premium. I guess it depends on your planned use of the cluster but in
 a
   lot
of cases more memory may be better spent, especially if you plan on
   running
things like HBase on the cluster also (which we do).
   
-stephen
   
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
   
  
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 



 --
 Kevin Sweeney
 Systems Engineer
 Yieldex -- www.yieldex.com
 (303) 999-7045



Re: Advice on new Datacenter Hadoop Cluster?

2009-09-30 Thread stephen mulcahy

Todd Lipcon wrote:

Most people building new clusters at this point seem to be leaning towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.


We went with a similar configuration for a recently purchased cluster 
but opted for qual quad core Opterons (Shanghai) rather than Nehalems 
and invested the difference in more memory per node (16GB). Nehalem seem 
to perform very well on some benchmarks but that performance comes at a 
premium. I guess it depends on your planned use of the cluster but in a 
lot of cases more memory may be better spent, especially if you plan on 
running things like HBase on the cluster also (which we do).


-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: Advice on new Datacenter Hadoop Cluster?

2009-09-30 Thread Patrick Angeles
We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
virtual cores so 12GB might not have been enough. These boxes are around $4k
each, but can easily outperform any $1K box dollar per dollar (and
performance per watt).

If you're extremely I/O bound, you can get single-socket configurations with
the same amount of drive spindles for really cheap (~$2k for single proc,
8-12GB RAM, 4x1TB drives).

On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
stephen.mulc...@deri.orgwrote:

 Todd Lipcon wrote:

 Most people building new clusters at this point seem to be leaning towards
 dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.


 We went with a similar configuration for a recently purchased cluster but
 opted for qual quad core Opterons (Shanghai) rather than Nehalems and
 invested the difference in more memory per node (16GB). Nehalem seem to
 perform very well on some benchmarks but that performance comes at a
 premium. I guess it depends on your planned use of the cluster but in a lot
 of cases more memory may be better spent, especially if you plan on running
 things like HBase on the cluster also (which we do).

 -stephen

 --
 Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
 NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
 http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com



Re: Advice on new Datacenter Hadoop Cluster?

2009-09-30 Thread Ted Dunning
2TB drives are just now dropping to parity with 1TB on a $/GB basis.

If you want space rather than speed, this is a good option.  If you want
speed rather than space, more spindles and smaller disks are better.
Ironically, 500GB drives now often cost more than 1TB drives (that is $, not
$/GB).

On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
patrickange...@gmail.comwrote:

 We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
 overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
 virtual cores so 12GB might not have been enough. These boxes are around
 $4k
 each, but can easily outperform any $1K box dollar per dollar (and
 performance per watt).

 If you're extremely I/O bound, you can get single-socket configurations
 with
 the same amount of drive spindles for really cheap (~$2k for single proc,
 8-12GB RAM, 4x1TB drives).

 On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
 stephen.mulc...@deri.orgwrote:

  Todd Lipcon wrote:
 
  Most people building new clusters at this point seem to be leaning
 towards
  dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
 
 
  We went with a similar configuration for a recently purchased cluster but
  opted for qual quad core Opterons (Shanghai) rather than Nehalems and
  invested the difference in more memory per node (16GB). Nehalem seem to
  perform very well on some benchmarks but that performance comes at a
  premium. I guess it depends on your planned use of the cluster but in a
 lot
  of cases more memory may be better spent, especially if you plan on
 running
  things like HBase on the cluster also (which we do).
 
  -stephen
 
  --
  Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
  NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
  http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
 




-- 
Ted Dunning, CTO
DeepDyve


Re: Advice on new Datacenter Hadoop Cluster?

2009-09-30 Thread Ted Dunning
Depending on your needs and the size of your cluster, the out-of-band
management can be of significant interest.  It is a pretty simple
cost/benefit analysis that trades your sysops time (which is probably about
the equivalent of $50-150 per hour fully loaded and accounting for
opportunity cost) versus the cost of IPMI cards.  If it takes an extra hour
of time to actually go to the data center per event and possibly another
hour of time because the data center is a lousy place to work, then the IPMI
card is probably about break-even.  In our case, it is more than an hour of
inconvenience and our systems guy has LOTs of things to do so the board's
are a no-brainer.

You don't say here what size the disks are.  Dual disks are a good idea for
any number of reasons.  I just saw a price this morning of about $170 for a
2TB drive and about half that for a 1TB drive so make sure you are doing at
least that well.

You are specifying only 4GB of RAM.  I would account that as severely
underpowering your machine.  My own preference is to put 4-8x that much RAM
on a machine with one or two quad core CPU's and four drives.  That still
fits in a 1U chassis and will out-perform several of the boxes that you are
describing, although perhaps not exactly on a $/cycle even trade-off.

There are also some very sweet twin setups where you get two beefy machines
in a single 1U slot.  Very nice.  For instance, you can put two dual CPU
quad core Nehalem processors with 48GB, a bunch of disk into 1U for about
$14K including paying somebody to set up the machine and a 3 year
maintenance contract.  You should be able to do this yourself for $12K or
less and this is equivalent to about something between 6 to 30 of the nodes
that you are spec'ing (2 x 2 x 4 cores vs 4 cores = 4x (but round up because
of fancier processors), 96GB vs 4 GB = 32x).  Cut off another K$ or two
because this is an older quote and the 2TB drives are much cheaper suddenly
as well.

On Wed, Sep 30, 2009 at 3:46 PM, Kevin Sweeney ke...@yieldex.com wrote:

 I really appreciate everyone's input. We've been going back and forth on
 the
 server size issue here. There are a few reasons we shot for the $1k price,
 one because we wanted to be able to compare our datacenter costs vs. the
 cloud costs. Another is that we have spec'd out a fast Intel node with
 over-the-counter parts. We have a hard time justifying the dual-processor
 costs and really don't see the need for the big server extras like
 out-of-band management and redundancy. This is our proposed config, feel
 free to criticize :)
 Supermicro 512L-260 Chassis $90
 Supermicro X8SIL  $160
 Heatsink$22
 Intel 3460 Xeon  $350
 Samsung 7200 RPM SATA2   2x$85
 2GB Non-ECC DIMM  4x$65

 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
 purpose of a hadoop cluster to build cheap,fast, replaceable nodes?



 On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  2TB drives are just now dropping to parity with 1TB on a $/GB basis.
 
  If you want space rather than speed, this is a good option.  If you want
  speed rather than space, more spindles and smaller disks are better.
  Ironically, 500GB drives now often cost more than 1TB drives (that is $,
  not
  $/GB).
 
  On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles
  patrickange...@gmail.comwrote:
 
   We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might
 be
   overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
   virtual cores so 12GB might not have been enough. These boxes are
 around
   $4k
   each, but can easily outperform any $1K box dollar per dollar (and
   performance per watt).
  
   If you're extremely I/O bound, you can get single-socket configurations
   with
   the same amount of drive spindles for really cheap (~$2k for single
 proc,
   8-12GB RAM, 4x1TB drives).
  
   On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy
   stephen.mulc...@deri.orgwrote:
  
Todd Lipcon wrote:
   
Most people building new clusters at this point seem to be leaning
   towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.
   
   
We went with a similar configuration for a recently purchased cluster
  but
opted for qual quad core Opterons (Shanghai) rather than Nehalems and
invested the difference in more memory per node (16GB). Nehalem seem
 to
perform very well on some benchmarks but that performance comes at a
premium. I guess it depends on your planned use of the cluster but in
 a
   lot
of cases more memory may be better spent, especially if you plan on
   running
things like HBase on the cluster also (which we do).
   
-stephen
   
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
   
  
 
 

Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread ylx_admin

Hey all, 

I'm pretty new to hadoop in general and I've been tasked with building out a
datacenter cluster of hadoop servers to process logfiles. We currently use
Amazon but our heavy usage is starting to justify running our own servers.
I'm aiming for less than $1k per box, and of course trying to economize on
power/rack. Can anyone give me some advice on what to pay attention to when
building these server nodes?

TIA,
Kevin
-- 
View this message in context: 
http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread Todd Lipcon
Hi Kevin,

Less than $1k/box is unrealistic and won't be your best price/performance.

Most people building new clusters at this point seem to be leaning towards
dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.

You're better off starting with a small cluster of these nicer machines than
3x as many $1k machines, assuming you can afford at least 4-5 of them.

-Todd

On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin nek...@hotmail.com wrote:


 Hey all,

 I'm pretty new to hadoop in general and I've been tasked with building out
 a
 datacenter cluster of hadoop servers to process logfiles. We currently use
 Amazon but our heavy usage is starting to justify running our own servers.
 I'm aiming for less than $1k per box, and of course trying to economize on
 power/rack. Can anyone give me some advice on what to pay attention to when
 building these server nodes?

 TIA,
 Kevin
 --
 View this message in context:
 http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Advice on new Datacenter Hadoop Cluster?

2009-09-29 Thread Amandeep Khurana
Also, if you plan to run HBase as well (now or in the future), you'll need
more RAM. Take that into account too.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi Kevin,

 Less than $1k/box is unrealistic and won't be your best price/performance.

 Most people building new clusters at this point seem to be leaning towards
 dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM.

 You're better off starting with a small cluster of these nicer machines
 than
 3x as many $1k machines, assuming you can afford at least 4-5 of them.

 -Todd

 On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin nek...@hotmail.com wrote:

 
  Hey all,
 
  I'm pretty new to hadoop in general and I've been tasked with building
 out
  a
  datacenter cluster of hadoop servers to process logfiles. We currently
 use
  Amazon but our heavy usage is starting to justify running our own
 servers.
  I'm aiming for less than $1k per box, and of course trying to economize
 on
  power/rack. Can anyone give me some advice on what to pay attention to
 when
  building these server nodes?
 
  TIA,
  Kevin
  --
  View this message in context:
 
 http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.