Re: FAQ for New to Hadoop
Ken, You can also take a look at the FAQ section in the posts we publish periodically. It started with http://blog.sematext.com/2010/02/16/hadoop-digest-february-2010/. The frequently asked questions are mainly retrieved from the project's user mailing lists. We also cover HBase (you can find posts on http://blog.sematext.com as well. Alex Baranau Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ On Fri, Jul 9, 2010 at 1:35 AM, Mark Kerzner markkerz...@gmail.com wrote: Cool, Ken, thank you, I think it is very useful. Mark On Thu, Jul 8, 2010 at 4:35 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, I recently hosted an Intro to Hadoop session at the BigDataCamp unconference last week. I later wrote down questions from the audience that seemed useful to other Hadoop beginners, and the compared this to the Hadoop project FAQ at http://wiki.apache.org/hadoop/FAQ There was overlap, but not as much as I expected - the Hadoop FAQ has more how do I do X versus can I do X or why should I do X. I posted these questions to http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp , and would appreciate any input - e.g. questions you think should be there, answers you think aren't very clear (though mea culpa in advance, I jotted these down quickly so I realize they're pretty rough). Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
FAQ for New to Hadoop
Hi all, I recently hosted an Intro to Hadoop session at the BigDataCamp unconference last week. I later wrote down questions from the audience that seemed useful to other Hadoop beginners, and the compared this to the Hadoop project FAQ at http://wiki.apache.org/hadoop/FAQ There was overlap, but not as much as I expected - the Hadoop FAQ has more how do I do X versus can I do X or why should I do X. I posted these questions to http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp , and would appreciate any input - e.g. questions you think should be there, answers you think aren't very clear (though mea culpa in advance, I jotted these down quickly so I realize they're pretty rough). Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: FAQ for New to Hadoop
Cool, Ken, thank you, I think it is very useful. Mark On Thu, Jul 8, 2010 at 4:35 PM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi all, I recently hosted an Intro to Hadoop session at the BigDataCamp unconference last week. I later wrote down questions from the audience that seemed useful to other Hadoop beginners, and the compared this to the Hadoop project FAQ at http://wiki.apache.org/hadoop/FAQ There was overlap, but not as much as I expected - the Hadoop FAQ has more how do I do X versus can I do X or why should I do X. I posted these questions to http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp , and would appreciate any input - e.g. questions you think should be there, answers you think aren't very clear (though mea culpa in advance, I jotted these down quickly so I realize they're pretty rough). Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
new to hadoop
Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: new to hadoop
How much RAM ? With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal guess). - Ravi On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote: thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on each node to specify -Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi Ravi --
Re: new to hadoop
thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: You can configure (conf/hadoop-env.sh) configuration files on each node to specify --Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi --
Re: new to hadoop
great. thank you. I'll set it up that way. Tom On 05/05/2010 00:37, Ravi Phulari wrote: How much RAM ? With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal guess). - Ravi On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote: thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on each node to specify --Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi Ravi --
Re: Advice on new Datacenter Hadoop Cluster?
Kevin Sweeney wrote: I really appreciate everyone's input. We've been going back and forth on the server size issue here. There are a few reasons we shot for the $1k price, one because we wanted to be able to compare our datacenter costs vs. the cloud costs. Another is that we have spec'd out a fast Intel node with over-the-counter parts. We have a hard time justifying the dual-processor costs and really don't see the need for the big server extras like out-of-band management and redundancy. This is our proposed config, feel free to criticize :) Supermicro 512L-260 Chassis $90 Supermicro X8SIL $160 Heatsink$22 Intel 3460 Xeon $350 Samsung 7200 RPM SATA2 2x$85 2GB Non-ECC DIMM 4x$65 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the purpose of a hadoop cluster to build cheap,fast, replaceable nodes? Disclaimer 1: I work for a server vendor so may be biased. I will attempt to avoid this by not pointing you at HP DL180 or SL170z servers. Disclaimer 2: I probably don't know what I'm talking about. As far as Hadoop concerned, I'm not sure anyone knows what is the right configuration. * I'd consider ECC RAM. On a large cluster, over time, errors occur -you either notice them or propagate the effects. * Worry about power, cooling and rack weight. * Include network costs, power budget. That's your own switch costs, plus bandwidth in and out. * There are some good arguments in favour of fewer, higher end machines over many smaller ones. Less network traffic, often a higher density. The cloud hosted vs owned is an interesting question; I suspect the spreadsheet there is pretty complex * Estimate how much data you will want to store over time. On S3, those costs ramp up fast; in your own rack you can maybe plan to stick in in an extra 2TB HDD a year from now (space, power, cooling and weight permitting), paying next year's prices for next year's capacity. * Virtual machine management costs are different from physical management costs, especially if you dont invest time upfront on automating your datacentre software provisioning (custom RPMs, PXE preboot, kickstart, etc). VMMs you can almost hand manage an image (naughty, but possible), as long as you have a single image or two to push out. Even then, i'd automate, but at a higher level, creating images on demand as load/availablity sees fit. -Steve
Re: Advice on new Datacenter Hadoop Cluster?
I have a question that i feel i should ask on this thread. Lets say you want to build a cluster where you will be doing very little map/reduce, storage and replication of data only on hdfs. What would the hardware requirements be? No quad core? less ram? Thanks -Ryan On Thu, Oct 1, 2009 at 7:36 AM, tim robertson timrobertson...@gmail.comwrote: Disclaimer: I am pretty useless when it comes to hardware I had a lot of issues with non ECC memory when running 100's millions inserts from MapReduce into HBase on a dev cluster. The errors were checksum errors, and the consensus was the memory was causing the issues and all advice was to ensure ECC memory. The same cluster ran without (any apparent) error for simple counting operations on tab delimited files. Cheers, Tim On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran ste...@apache.org wrote: Kevin Sweeney wrote: I really appreciate everyone's input. We've been going back and forth on the server size issue here. There are a few reasons we shot for the $1k price, one because we wanted to be able to compare our datacenter costs vs. the cloud costs. Another is that we have spec'd out a fast Intel node with over-the-counter parts. We have a hard time justifying the dual-processor costs and really don't see the need for the big server extras like out-of-band management and redundancy. This is our proposed config, feel free to criticize :) Supermicro 512L-260 Chassis $90 Supermicro X8SIL $160 Heatsink$22 Intel 3460 Xeon $350 Samsung 7200 RPM SATA2 2x$85 2GB Non-ECC DIMM 4x$65 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the purpose of a hadoop cluster to build cheap,fast, replaceable nodes? Disclaimer 1: I work for a server vendor so may be biased. I will attempt to avoid this by not pointing you at HP DL180 or SL170z servers. Disclaimer 2: I probably don't know what I'm talking about. As far as Hadoop concerned, I'm not sure anyone knows what is the right configuration. * I'd consider ECC RAM. On a large cluster, over time, errors occur -you either notice them or propagate the effects. * Worry about power, cooling and rack weight. * Include network costs, power budget. That's your own switch costs, plus bandwidth in and out. * There are some good arguments in favour of fewer, higher end machines over many smaller ones. Less network traffic, often a higher density. The cloud hosted vs owned is an interesting question; I suspect the spreadsheet there is pretty complex * Estimate how much data you will want to store over time. On S3, those costs ramp up fast; in your own rack you can maybe plan to stick in in an extra 2TB HDD a year from now (space, power, cooling and weight permitting), paying next year's prices for next year's capacity. * Virtual machine management costs are different from physical management costs, especially if you dont invest time upfront on automating your datacentre software provisioning (custom RPMs, PXE preboot, kickstart, etc). VMMs you can almost hand manage an image (naughty, but possible), as long as you have a single image or two to push out. Even then, i'd automate, but at a higher level, creating images on demand as load/availablity sees fit. -Steve
Re: Advice on new Datacenter Hadoop Cluster?
Ryan Smith wrote: I have a question that i feel i should ask on this thread. Lets say you want to build a cluster where you will be doing very little map/reduce, storage and replication of data only on hdfs. What would the hardware requirements be? No quad core? less ram? Servers with more HDD per CPU, and less RAM. CPUs are a big slice not just of capital, but of your power budget. If you are running a big datacentre, you will care about that electricity bill. Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or 12 TB per U, then perhaps a 2-core or 4-core server with enough ECC RAM. * with less M/R work, you could allocate most of that TB to work, leave a few hundred GB for OS and logs * you'd better estimate external load; if the cluster is storing data then total network bandwidth will be 3X the data ingress (for replication = 3), read costs are that of the data itself. Also, 5 threads on 3 different machines handing the write and forward process. * I don't know how much load the datanode JVM would take with, say 11 TB of managed storage underneath; that's memory and CPU time. Is anyone out there running big datanodes? What do they see? -steve
Re: Advice on new Datacenter Hadoop Cluster?
On Oct 1, 2009, at 7:13 AM, Steve Loughran wrote: Ryan Smith wrote: I have a question that i feel i should ask on this thread. Lets say you want to build a cluster where you will be doing very little map/ reduce, storage and replication of data only on hdfs. What would the hardware requirements be? No quad core? less ram? Servers with more HDD per CPU, and less RAM. CPUs are a big slice not just of capital, but of your power budget. If you are running a big datacentre, you will care about that electricity bill. Assuming you go for 1U with 6 HDD in a 1U box, you could have 6 or 12 TB per U, then perhaps a 2-core or 4-core server with enough ECC RAM. * with less M/R work, you could allocate most of that TB to work, leave a few hundred GB for OS and logs * you'd better estimate external load; if the cluster is storing data then total network bandwidth will be 3X the data ingress (for replication = 3), read costs are that of the data itself. Also, 5 threads on 3 different machines handing the write and forward process. * I don't know how much load the datanode JVM would take with, say 11 TB of managed storage underneath; that's memory and CPU time. Datanode load is a function of the number of IOPS. Basically, buying 6 12TB nodes versus 3 24TB nodes, you double the number of IOPS per node. If you're using HDFS solely for backup, then the number of IOPS is so small you can assume it's zero. We use HDFS for a non-mapreduce physics application, and our particular application mix is such that I target 1 batch system core per usable HDFS TB. Is anyone out there running big datanodes? What do they see? Our biggest is 48TB: * They go offline for 5 minutes during the block reports. We use rack awareness to make sure that both copies are not on big data nodes. Fixed in future releases (0.20.0 even, maybe). * When one disk goes out, the datanode shuts down - meaning that 48 disks go out. This is to be fixed in 0.21.0, I think. * The CPUs (4 cores) are pegged when the system is under full load. If I had a chance, I'd give it more CPU horsepower. As usual, everyone's application is different enough that any anecdote is possibly not applicable. Brian smime.p7s Description: S/MIME cryptographic signature
Re: Advice on new Datacenter Hadoop Cluster?
I wouldn't spec the worker nodes just to facilitate cloud cost comparison. There's enough variability out there and you'd have to deal with storage, network bandwidth and I/O. Not to mention a similarly spec'd virtual cloud server will never perform as well as a physical server because you don't get data locality. Unless you have something like Amazon's EBS, but then that jacks up your costs. Also, you shouldn't assume that 'big server' will include out-of-band management or redundancy. Also take into account performance per watt. Dual socket machines do better here. Just like you, I wouldn't go with high ghz ('faster') Intel procs because they are power hungry and generate lots of heat for the incremental speed bump that you get. (After all, you're not building a gaming rig.) However, you can go dual-socket with lower speed processors. I think the lowest ghz Nehalems that support hyper-threading are good value. For example, compare the Xeon 3460 @ 2.8ghz ($360) to the 3440 @ 2.53ghz ($240). That's about a 10% speed bump for a 50% price increase, and that's without factoring in the power consumption. Granted, you need to take into account the cost of the entire server, not just the processor. On Wed, Sep 30, 2009 at 6:46 PM, Kevin Sweeney ke...@yieldex.com wrote: I really appreciate everyone's input. We've been going back and forth on the server size issue here. There are a few reasons we shot for the $1k price, one because we wanted to be able to compare our datacenter costs vs. the cloud costs. Another is that we have spec'd out a fast Intel node with over-the-counter parts. We have a hard time justifying the dual-processor costs and really don't see the need for the big server extras like out-of-band management and redundancy. This is our proposed config, feel free to criticize :) Supermicro 512L-260 Chassis $90 Supermicro X8SIL $160 Heatsink$22 Intel 3460 Xeon $350 Samsung 7200 RPM SATA2 2x$85 2GB Non-ECC DIMM 4x$65 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the purpose of a hadoop cluster to build cheap,fast, replaceable nodes? On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: 2TB drives are just now dropping to parity with 1TB on a $/GB basis. If you want space rather than speed, this is a good option. If you want speed rather than space, more spindles and smaller disks are better. Ironically, 500GB drives now often cost more than 1TB drives (that is $, not $/GB). On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles patrickange...@gmail.comwrote: We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16 virtual cores so 12GB might not have been enough. These boxes are around $4k each, but can easily outperform any $1K box dollar per dollar (and performance per watt). If you're extremely I/O bound, you can get single-socket configurations with the same amount of drive spindles for really cheap (~$2k for single proc, 8-12GB RAM, 4x1TB drives). On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Todd Lipcon wrote: Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. We went with a similar configuration for a recently purchased cluster but opted for qual quad core Opterons (Shanghai) rather than Nehalems and invested the difference in more memory per node (16GB). Nehalem seem to perform very well on some benchmarks but that performance comes at a premium. I guess it depends on your planned use of the cluster but in a lot of cases more memory may be better spent, especially if you plan on running things like HBase on the cluster also (which we do). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com -- Ted Dunning, CTO DeepDyve -- Kevin Sweeney Systems Engineer Yieldex -- www.yieldex.com (303) 999-7045
Re: Advice on new Datacenter Hadoop Cluster?
Todd Lipcon wrote: Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. We went with a similar configuration for a recently purchased cluster but opted for qual quad core Opterons (Shanghai) rather than Nehalems and invested the difference in more memory per node (16GB). Nehalem seem to perform very well on some benchmarks but that performance comes at a premium. I guess it depends on your planned use of the cluster but in a lot of cases more memory may be better spent, especially if you plan on running things like HBase on the cluster also (which we do). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Advice on new Datacenter Hadoop Cluster?
We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16 virtual cores so 12GB might not have been enough. These boxes are around $4k each, but can easily outperform any $1K box dollar per dollar (and performance per watt). If you're extremely I/O bound, you can get single-socket configurations with the same amount of drive spindles for really cheap (~$2k for single proc, 8-12GB RAM, 4x1TB drives). On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Todd Lipcon wrote: Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. We went with a similar configuration for a recently purchased cluster but opted for qual quad core Opterons (Shanghai) rather than Nehalems and invested the difference in more memory per node (16GB). Nehalem seem to perform very well on some benchmarks but that performance comes at a premium. I guess it depends on your planned use of the cluster but in a lot of cases more memory may be better spent, especially if you plan on running things like HBase on the cluster also (which we do). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Advice on new Datacenter Hadoop Cluster?
2TB drives are just now dropping to parity with 1TB on a $/GB basis. If you want space rather than speed, this is a good option. If you want speed rather than space, more spindles and smaller disks are better. Ironically, 500GB drives now often cost more than 1TB drives (that is $, not $/GB). On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles patrickange...@gmail.comwrote: We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16 virtual cores so 12GB might not have been enough. These boxes are around $4k each, but can easily outperform any $1K box dollar per dollar (and performance per watt). If you're extremely I/O bound, you can get single-socket configurations with the same amount of drive spindles for really cheap (~$2k for single proc, 8-12GB RAM, 4x1TB drives). On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Todd Lipcon wrote: Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. We went with a similar configuration for a recently purchased cluster but opted for qual quad core Opterons (Shanghai) rather than Nehalems and invested the difference in more memory per node (16GB). Nehalem seem to perform very well on some benchmarks but that performance comes at a premium. I guess it depends on your planned use of the cluster but in a lot of cases more memory may be better spent, especially if you plan on running things like HBase on the cluster also (which we do). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com -- Ted Dunning, CTO DeepDyve
Re: Advice on new Datacenter Hadoop Cluster?
Depending on your needs and the size of your cluster, the out-of-band management can be of significant interest. It is a pretty simple cost/benefit analysis that trades your sysops time (which is probably about the equivalent of $50-150 per hour fully loaded and accounting for opportunity cost) versus the cost of IPMI cards. If it takes an extra hour of time to actually go to the data center per event and possibly another hour of time because the data center is a lousy place to work, then the IPMI card is probably about break-even. In our case, it is more than an hour of inconvenience and our systems guy has LOTs of things to do so the board's are a no-brainer. You don't say here what size the disks are. Dual disks are a good idea for any number of reasons. I just saw a price this morning of about $170 for a 2TB drive and about half that for a 1TB drive so make sure you are doing at least that well. You are specifying only 4GB of RAM. I would account that as severely underpowering your machine. My own preference is to put 4-8x that much RAM on a machine with one or two quad core CPU's and four drives. That still fits in a 1U chassis and will out-perform several of the boxes that you are describing, although perhaps not exactly on a $/cycle even trade-off. There are also some very sweet twin setups where you get two beefy machines in a single 1U slot. Very nice. For instance, you can put two dual CPU quad core Nehalem processors with 48GB, a bunch of disk into 1U for about $14K including paying somebody to set up the machine and a 3 year maintenance contract. You should be able to do this yourself for $12K or less and this is equivalent to about something between 6 to 30 of the nodes that you are spec'ing (2 x 2 x 4 cores vs 4 cores = 4x (but round up because of fancier processors), 96GB vs 4 GB = 32x). Cut off another K$ or two because this is an older quote and the 2TB drives are much cheaper suddenly as well. On Wed, Sep 30, 2009 at 3:46 PM, Kevin Sweeney ke...@yieldex.com wrote: I really appreciate everyone's input. We've been going back and forth on the server size issue here. There are a few reasons we shot for the $1k price, one because we wanted to be able to compare our datacenter costs vs. the cloud costs. Another is that we have spec'd out a fast Intel node with over-the-counter parts. We have a hard time justifying the dual-processor costs and really don't see the need for the big server extras like out-of-band management and redundancy. This is our proposed config, feel free to criticize :) Supermicro 512L-260 Chassis $90 Supermicro X8SIL $160 Heatsink$22 Intel 3460 Xeon $350 Samsung 7200 RPM SATA2 2x$85 2GB Non-ECC DIMM 4x$65 This totals $1052. Doesn't this seem like a reasonable setup? Isn't the purpose of a hadoop cluster to build cheap,fast, replaceable nodes? On Wed, Sep 30, 2009 at 9:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: 2TB drives are just now dropping to parity with 1TB on a $/GB basis. If you want space rather than speed, this is a good option. If you want speed rather than space, more spindles and smaller disks are better. Ironically, 500GB drives now often cost more than 1TB drives (that is $, not $/GB). On Wed, Sep 30, 2009 at 7:33 AM, Patrick Angeles patrickange...@gmail.comwrote: We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16 virtual cores so 12GB might not have been enough. These boxes are around $4k each, but can easily outperform any $1K box dollar per dollar (and performance per watt). If you're extremely I/O bound, you can get single-socket configurations with the same amount of drive spindles for really cheap (~$2k for single proc, 8-12GB RAM, 4x1TB drives). On Wed, Sep 30, 2009 at 10:19 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: Todd Lipcon wrote: Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. We went with a similar configuration for a recently purchased cluster but opted for qual quad core Opterons (Shanghai) rather than Nehalems and invested the difference in more memory per node (16GB). Nehalem seem to perform very well on some benchmarks but that performance comes at a premium. I guess it depends on your planned use of the cluster but in a lot of cases more memory may be better spent, especially if you plan on running things like HBase on the cluster also (which we do). -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Advice on new Datacenter Hadoop Cluster?
Hey all, I'm pretty new to hadoop in general and I've been tasked with building out a datacenter cluster of hadoop servers to process logfiles. We currently use Amazon but our heavy usage is starting to justify running our own servers. I'm aiming for less than $1k per box, and of course trying to economize on power/rack. Can anyone give me some advice on what to pay attention to when building these server nodes? TIA, Kevin -- View this message in context: http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Advice on new Datacenter Hadoop Cluster?
Hi Kevin, Less than $1k/box is unrealistic and won't be your best price/performance. Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. You're better off starting with a small cluster of these nicer machines than 3x as many $1k machines, assuming you can afford at least 4-5 of them. -Todd On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin nek...@hotmail.com wrote: Hey all, I'm pretty new to hadoop in general and I've been tasked with building out a datacenter cluster of hadoop servers to process logfiles. We currently use Amazon but our heavy usage is starting to justify running our own servers. I'm aiming for less than $1k per box, and of course trying to economize on power/rack. Can anyone give me some advice on what to pay attention to when building these server nodes? TIA, Kevin -- View this message in context: http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Advice on new Datacenter Hadoop Cluster?
Also, if you plan to run HBase as well (now or in the future), you'll need more RAM. Take that into account too. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Sep 29, 2009 at 10:59 AM, Todd Lipcon t...@cloudera.com wrote: Hi Kevin, Less than $1k/box is unrealistic and won't be your best price/performance. Most people building new clusters at this point seem to be leaning towards dual quad core Nehalem with 4x1TB 7200RPM SATA and at least 8G RAM. You're better off starting with a small cluster of these nicer machines than 3x as many $1k machines, assuming you can afford at least 4-5 of them. -Todd On Tue, Sep 29, 2009 at 10:57 AM, ylx_admin nek...@hotmail.com wrote: Hey all, I'm pretty new to hadoop in general and I've been tasked with building out a datacenter cluster of hadoop servers to process logfiles. We currently use Amazon but our heavy usage is starting to justify running our own servers. I'm aiming for less than $1k per box, and of course trying to economize on power/rack. Can anyone give me some advice on what to pay attention to when building these server nodes? TIA, Kevin -- View this message in context: http://www.nabble.com/Advice-on-new-Datacenter-Hadoop-Cluster--tp25667905p25667905.html Sent from the Hadoop core-user mailing list archive at Nabble.com.