Re: Hadoop Java Versions
We could create an apache hadoop list of product selection discussions. I believe this list is intended to be focused on project governance and similar discussions. Maybe we should simply create a governance list and leave this one to be the free for all? On Jul 2, 2011, at 9:16 PM, Ian Holsman wrote: On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote: On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote: open and honest conversations on this thread/group, irrespective of whether one represents a company or apache, should be encouraged. Paradoxically, I partially agree with Ian on this. On the one hand, it is important to hear about alternatives in the same eco-system (and alternative approaches from other communities). That is a bit different from Ian's view. While I don't mind that technical alternatives get discussed, I do get PO'd when the conversation goes into why product X is better than Y, or when someone makes claims that are incorrect because some 'customer' told them and stuff like that. If we can keep it at architecture/approaches instead of why a certain product is better than go right ahead. Where I agree with him is that discussions should not be in the form of simple breast beating. Discussions should be driven by data and experience. All data and experience are of equal value if they represent quality thought. The only place that company names should figure in this is as a bookmark so you can tell which product/release has the characteristic under consideration. ... And in that sense we all owe ASF and the hadoop community (and not any one company) an equal amount of gratitude, humility and respect. This doesn't get said nearly enough.
Re: Hadoop Java Versions
On Wed, Jul 13, 2011 at 7:59 AM, Eric Baldeschwieler eri...@hortonworks.com wrote: We could create an apache hadoop list of product selection discussions. I believe this list is intended to be focused on project governance and similar discussions. Maybe we should simply create a governance list and leave this one to be the free for all? If you need to separate the discussions, then taking the discussion with smaller/more experienced set of participants to the new list is going to be easier than trying to sweep the tide. A list named general sounds a lot like general discussions to newcomers (and frankly, to old-timers like me). For that matter, is there a strong reason to segregate the discussions?
Re: Hadoop Java Versions
On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote: On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote: open and honest conversations on this thread/group, irrespective of whether one represents a company or apache, should be encouraged. Paradoxically, I partially agree with Ian on this. On the one hand, it is important to hear about alternatives in the same eco-system (and alternative approaches from other communities). That is a bit different from Ian's view. While I don't mind that technical alternatives get discussed, I do get PO'd when the conversation goes into why product X is better than Y, or when someone makes claims that are incorrect because some 'customer' told them and stuff like that. If we can keep it at architecture/approaches instead of why a certain product is better than go right ahead. Where I agree with him is that discussions should not be in the form of simple breast beating. Discussions should be driven by data and experience. All data and experience are of equal value if they represent quality thought. The only place that company names should figure in this is as a bookmark so you can tell which product/release has the characteristic under consideration. ... And in that sense we all owe ASF and the hadoop community (and not any one company) an equal amount of gratitude, humility and respect. This doesn't get said nearly enough.
1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) Evert On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nlwrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
What's the justification for a management interface? Doesn't that increase complexity? Also you still twice the ports? On Jun 30, 2011 11:54 PM, Evert Lammerts evert.lamme...@sara.nl wrote: Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) Evert On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl wrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
RE: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
What's the justification for a management interface? Doesn't that increase complexity? Also you still twice the ports? That's true. It's a tradition that I haven't questioned before. The reasoning, whether right or wrong, is that user jobs (our users are external) can get in the way of admins. If there's a lot of network traffic it takes a lot longer to re-install a node. It also makes security easier - we have a seperate SSH deamon listening on the management net accepting root logins - something the default deamon does not do. Of course the same can be done by running a seperate deamon on the public interface, on a port that is only accessible from our internal network. Evert On Jun 30, 2011 11:54 PM, Evert Lammerts evert.lamme...@sara.nl wrote: Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) Evert On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl wrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
On 01/07/2011 07:41, Evert Lammerts wrote: Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? I don't know how others setup up their clusters. We have the tradition that every node in a cluster has at least three interfaces - one for interconnects, one for a management network (only reachable from within our own network and the primary interface for our admins, accessible only through a single management node) and one for ILOM, DRAC or whatever lights out manager is available. This doesn't leave us room for bonding interfaces on off the shelf nodes. Plus - you'd need twice as many ports in your switch. Yes, I didn't get into ILO. That can be 100Mbps. In the case of Hadoop we're considering adding a fourth NIC for external connectivity. We don't want people interacting with HDFS from outside while jobs are using the interconnects. Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb ethernet prices are approaching that of 1Gb ethernet it gets more attractive. The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just saying that since we have other variables to tweak - amount of disks and number of nodes - we can limit the impact of minimizing recovery times. 2Gb bonded with 2x ToR removes a single ToR switch as an SPOF for the rack but increases install/debugging costs. More wires, and your network configuration topology has got worse. Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more attractive to stop using HDFS (or at least, data locality) and start using an external storage cluster. Compute node failure then has no impact, disk failure is hardly noticed by compute nodes. But this is really still very far from as cheap as many small nodes with relatively little disks - I really like the bang for buck you get with Hadoop :-) Yeah, but you are a computing facility whose backbone gets data off the LHC tier 1 sites abroad faster than the seek time of the disks in your neighbouring building...
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
On 01/07/2011 08:16, Ryan Rawson wrote: What's the justification for a management interface? Doesn't that increase complexity? Also you still twice the ports? ILO reduces ops complexity. you can push things like BIOS updates out, boot machines into known states, instead of the slowly diverging world that RPM or debian updates get you into (the final state depends on the order, and different machines end up applying them in a different order), and for diagnosing problems when even the root disk doesn't want to come out and play. In a big cluster you need to worry about things like not powering on a quadrant of the site simultaneously as boot-time can be a peak power surge; you may want to bring up slices of every rack gradually, upgrade the BIOS and OS, and gradually ramp things up. This is where HPC admin tools differ from classic lock down an end user windows PC tooling.
Re: Hadoop Java Versions
On 01/07/2011 01:16, Ted Dunning wrote: You have to consider the long-term reliability as well. Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. There's also Rodrigo's work on alternate block placement that doesn't scatter blocks quite so randomly across a cluster, so a loss of a node or rack doesn't have adverse effects on so many files https://issues.apache.org/jira/browse/HDFS-1094 Given that most HDDs failures happen on cluster reboot, it is possible for 10-12 disks not to come up at the same time, if the cluster has been up for a while, but like Todd says -worry. At least a bit. I've heard hints of one FS that actually includes HDD batch data in block placement, to try and scatter data across batches, and be biased towards using new HDDs for temp storage during burn-in. Some research work on doing that to HDFS could be something to keep some postgraduate busy for a while, Disk batch-aware block placement. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Because it seems to me that stock HDFS's use in production makes it one of the filesystems in the planet with the most number of non-RAIDed HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form of Hadoop support too, if you ask nicely). HDD/server numbers matter in that in a small cluster, it's better to have fewer machines to get more servers to spread the data over; you don't really want your 100 TB in three 1U servers. As your cluster grows -and you care more about storage capacity than raw compute- then the appeal of 24+ TB/server starts to look good, and that's when you care about the improvements to datanodes handling loss of worker disk better. Even without that, rebooting the DN may fix things, but the impact on ongoing work is the big issue -you don't just lose a replicated block, you lose data. Cascade failures leading to cluster outages are a separate issue and normally triggered by switch failure/config than anything else. It doesn't matter how reliable the hardware is if it gets the wrong configuration
Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)
You are twice the ports, but if you know that the management port is not used for serious data, then you can put a SOHO grade switch on those ports at negligible cost. There is a serious conflict of goals here if you have software that can make serious use more than one NIC. On the one hand, it is nice to use the hardware you have. On the other, it is really nice to have guaranteed bandwidth for management. With traditional switch level link aggregation this is usually not a problem since each flow is committed to one NIC or the other, resulting in poor balancing. The silver lining of the poor balancing is that there is always plenty of bandwidth for administration function. Though it may be slightly controversial to mention, the way that MapRdoes application level NIC bonding could cause some backlash from ops staff because it can actually saturate as many NIC's as you make available. Generally, management doesn't require much bandwidth and doing things like reloading the BIOS is usually done when a machine is out of service for maintenance, but the potential for surprise is there. I have definitely seen a number of conditions where service access is a complete god-send. With geographically dispersed data centers, it is an absolute requirement because you just can't staff every data center with enough hands-on admins within a few minutes travel time of the data center. ILO (or DRAC as Dell calls it) gives you 5 minute response time to fixing totally horked machines. On Fri, Jul 1, 2011 at 8:47 AM, Steve Loughran ste...@apache.org wrote: On 01/07/2011 08:16, Ryan Rawson wrote: What's the justification for a management interface? Doesn't that increase complexity? Also you still twice the ports? ILO reduces ops complexity. you can push things like BIOS updates out, boot machines into known states, instead of the slowly diverging world that RPM or debian updates get you into (the final state depends on the order, and different machines end up applying them in a different order), and for diagnosing problems when even the root disk doesn't want to come out and play. In a big cluster you need to worry about things like not powering on a quadrant of the site simultaneously as boot-time can be a peak power surge; you may want to bring up slices of every rack gradually, upgrade the BIOS and OS, and gradually ramp things up. This is where HPC admin tools differ from classic lock down an end user windows PC tooling.
Re: Hadoop Java Versions
Although this thread is wandering a bit, I disagree strongly that it is inappropriate to discuss other vendor specific features (or competing compute platform features) on general@. The topic has become the factors that influence hardware purchase choices, and one of those is how the system deals with disk failure. Compare/contrast with other platforms is healthy for the Hadoop project! On 6/30/11 9:47 PM, Ian Holsman had...@holsman.net wrote: On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums. Thanks! Ian.
Re: Hadoop Java Versions
On Fri, Jul 1, 2011 at 9:12 AM, Steve Loughran ste...@apache.org wrote: On 01/07/2011 01:16, Ted Dunning wrote: You have to consider the long-term reliability as well. Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. There's also Rodrigo's work on alternate block placement that doesn't scatter blocks quite so randomly across a cluster, so a loss of a node or rack doesn't have adverse effects on so many files https://issues.apache.org/**jira/browse/HDFS-1094https://issues.apache.org/jira/browse/HDFS-1094 I did calculations based on this as well. The heuristic level of the computation is pretty simple, but to go any deeper, you have a pretty hair computation. My own approach was to use Monte Carlo Markov Chain to sample from the failure mode distribution. The codes that I wrote for this used pluggable placement, replication and failure models. I may have lacked sufficient cleverness at the time, but it was very difficult to come up with structured placement policies that actually improved the failure probabilities. Most such strategies massively decreased average probabilities. My suspicion by analogy with large structured error correction codes is that there are structured placement policies that perform well, but that in reasonably large clusters (number of disks 50, say), that random placement will be within epsilon of the best possible strategy with very high probability. Given that most HDDs failures happen on cluster reboot, it is possible for 10-12 disks not to come up at the same time, if the cluster has been up for a while, but like Todd says -worry. At least a bit. Indeed. Thank goodness also that disk manufacturers tend to be pessimistic in quoting MTBF. These possibilities of correlated failure seriously complicate these computations, of course. I've heard hints of one FS that actually includes HDD batch data in block placement, to try and scatter data across batches, and be biased towards using new HDDs for temp storage during burn-in. Some research work on doing that to HDFS could be something to keep some postgraduate busy for a while, Disk batch-aware block placement. Sadly, I can't comment on my knowledge of this except to say that there are non-obvious solutions to this that are embedded in at least one commercial map-reduce related product. I can't say which without getting chastised. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Per system. ..with the most number of non-RAIDed HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form of Hadoop support too, if you ask nicely). HDD/server numbers matter in that in a small cluster, it's better to have fewer machines to get more servers to spread the data over; you don't really want your 100 TB in three 1U servers. As your cluster grows -and you care more about storage capacity than raw compute- then the appeal of 24+ TB/server starts to look good, and that's when you care about the improvements to datanodes handling loss of worker disk better. Even without that, rebooting the DN may fix things, but the impact on ongoing work is the big issue -you don't just lose a replicated block, you lose data. Generally, I agree with what you say. The effect of RAID is to squeeze the error distributions around so that partial failures have lower probability. This is complex in the aggregate. Cascade failures leading to cluster outages are a separate issue and normally triggered by switch failure/config than anything else. It doesn't matter how reliable the hardware is if it gets the wrong configuration Indeed.
Re: Hadoop Java Versions
i definitely agree with scott. as a user of the hadoop open source stack for building our banking focused big data analytics applications, i speak on behalf of our clients and the emerging hadoop eco-system that open and honest conversations on this thread/group, irrespective of whether one represents a company or apache, should be encouraged. as an instance, with the fact that cloudera, mapR and soon hortonworks are all going to be offering competing hadoop distros for enterprises, it is important for all of us (and prospective users) to understand what they are doing to address critical gaps on the platform, and how the hadoop ecosystem benefits from it. From our perspective, it doesn't matter if one is better than the other (which is not the point i saw ted or mc making), but that companies, startups, apache and everybody else: 1. is thinking of the right issues 2. willing to solve them (and ideally contributing the solutions back) and 3. informing the exploding hadoop userbase of what not to do I see it benefitting all of us, especially as Hadoop rapidly jumps the transom and becomes the platform of choice for data management in industries like banking, retail and healthcare...just as it has for social media and the web... isn't that what we are launching our business plans around anyway... And in that sense we all owe ASF and the hadoop community (and not any one company) an equal amount of gratitude, humility and respect. On Jul 1, 2011, at 1:22 PM, Scott Carey wrote: Although this thread is wandering a bit, I disagree strongly that it is inappropriate to discuss other vendor specific features (or competing compute platform features) on general@. The topic has become the factors that influence hardware purchase choices, and one of those is how the system deals with disk failure. Compare/contrast with other platforms is healthy for the Hadoop project! +1 On 6/30/11 9:47 PM, Ian Holsman had...@holsman.net wrote: On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums. Thanks! Ian.
Re: Hadoop Java Versions
On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote: open and honest conversations on this thread/group, irrespective of whether one represents a company or apache, should be encouraged. Paradoxically, I partially agree with Ian on this. On the one hand, it is important to hear about alternatives in the same eco-system (and alternative approaches from other communities). That is a bit different from Ian's view. Where I agree with him is that discussions should not be in the form of simple breast beating. Discussions should be driven by data and experience. All data and experience are of equal value if they represent quality thought. The only place that company names should figure in this is as a bookmark so you can tell which product/release has the characteristic under consideration. ... And in that sense we all owe ASF and the hadoop community (and not any one company) an equal amount of gratitude, humility and respect. This doesn't get said nearly enough.
RE: Hadoop Java Versions
You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
RE: Hadoop Java Versions
That's not a question I'm qualified to answer. I do know we're now buying an Arista for a different cluster, but there's probably loads others out there. *forwarded to general@...* From: Abhishek Mehta [abhis...@tresata.com] Sent: Thursday, June 30, 2011 11:38 PM To: Evert Lammerts Subject: Fwd: Hadoop Java Versions what are the other switch options (other than cisco that is?)? cheers Abhishek Mehta (e) abhis...@tresata.commailto:abhis...@tresata.com (v) 980.355.9855 Begin forwarded message: From: Evert Lammerts evert.lamme...@sara.nlmailto:evert.lamme...@sara.nl Date: June 30, 2011 5:31:26 PM EDT To: general@hadoop.apache.orgmailto:general@hadoop.apache.org general@hadoop.apache.orgmailto:general@hadoop.apache.org Subject: RE: Hadoop Java Versions Reply-To: general@hadoop.apache.orgmailto:general@hadoop.apache.org You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
Re: Hadoop Java Versions
Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nlwrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
Re: Hadoop Java Versions
You have to consider the long-term reliability as well. Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. MapR avoids this by not failing nodes for trivial problems. On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng a...@maprtech.com wrote: Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl wrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert
Re: Hadoop Java Versions
On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning tdunn...@maprtech.com wrote: You have to consider the long-term reliability as well. Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. MapR avoids this by not failing nodes for trivial problems. I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng a...@maprtech.com wrote: Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. It keeps the impact of dead nodes in control but I don't think thats long-term cost efficient. As prices of 10GbE go down, the keep the node small arguement seems less fitting. And on another note, most servers manufactured in the last 10 years have dual 1GbE network interfaces. If one were to go by these calcs: 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover It seems like that assumes a single 1GbE interface, why not leverage the second? On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl wrote: You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. Keeping the amount of disks per node low and the amount of nodes high should keep the impact of dead nodes in control. A ToR switch failing is different - missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that actually increases ToR switch failure. Although such failure is quite rare to begin with, I guess. The back-of-the-envelope-calculation I made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. Also see the attached spreadsheet.) This doesn't take ToR switch failure in account though. On the other hand - 150 nodes is only ~5 racks - in such a scenario you might rather want to shut the system down completely rather than letting it replicate 20% of all data. Cheers, Evert -- Todd Lipcon Software Engineer, Cloudera
Re: Hadoop Java Versions
Good point Todd. I was speaking from the experience of people I know who are using 0.20.x On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning tdunn...@maprtech.com wrote: You have to consider the long-term reliability as well. Losing an entire set of 10 or 12 disks at once makes the overall reliability of a large cluster very suspect. This is because it becomes entirely too likely that two additional drives will fail before the data on the off-line node can be replicated. For 100 nodes, that can decrease the average time to data loss down to less than a year. This can only be mitigated in stock hadoop by keeping the number of drives relatively low. MapR avoids this by not failing nodes for trivial problems. I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches.
Re: Hadoop Java Versions
On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas.
Re: Hadoop Java Versions
On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums. Thanks! Ian.
Re: Hadoop Java Versions
No worries. I read Todd's post as asking for elaboration ... sometimes knowing what another similar system does helps in improving your own. On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman had...@holsman.net wrote: On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote: On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote: I'd advise you to look at stock hadoop again. This used to be true, but was fixed a long while back by HDFS-457 and several followup JIRAs. If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums. Thanks! Ian.
Re: Hadoop Java Versions
Ian, Good point. Srivas was responding to Todd's question, but there might be better fora as you suggest. We have a good one for specific questions about MapR at http://answers.mapr.com That doesn't, however, really provide a useful forum for questions like Todd's which really spans both domains. Where would you suggest for questions that span the two subjects? On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman had...@holsman.net wrote: If MapR does something fancier, I'm sure we'd be interested to hear about it so we can compare the approaches. -Todd MapR tracks disk responsiveness. In other words, a moving histogram of IO-completion times is maintained internally, and if a disk starts getting really slow, it is pre-emptively taken offline so it does not create long tails for running jobs (and the data on the disk is re-replicated using whatever re-replication policy is in place). One of the benefits of managing the disks directly instead of through ext3 / xfs / or other ... All these stats can be fed into Ganglia (or pushed out centrally via a text file that can be pulled out using NFS) if historical info about disk behavior (and failures) needs to be preserved. - Srivas. While I am intrigued about how MapR performs internally, I don't think this is the forum for it. please keep MapR (and other vendor specific discussions) on their respective support forums.
Re: Hadoop Java Versions
On 28/06/11 04:49, Segel, Mike wrote: Hmmm. I could have sworn there was a background balancing bandwidth limiter. There is, for the rebalancer, node outages are taken more seriously, though there have been problems in past 0.20.x where there was a risk of a cascade failure on a big switch/rack failure. The risk has been reduced, though we all await field reports to confirm this :) You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, more things to go wrong with the wiring, etc. The other thing to consider is how well the enterprise switches work in this world -with a Hadoop cluster you can really test those claims how well the switches handle every port lighting up at full rate. Indeed, I recommend that as part of your acceptance tests for the switch.
Re: Hadoop Java Versions
You're preaching to the choir. :-) With Sandybridge, you're going to start seeing 10 GBe on the motherboard. We built our clusters using 1U boxes where you're stuck w 4 3.5 drives. With larger chassis, You can fit an additional controller card and more drives. More drives reduces the bottleneck and means your performance will be throttled by your network and the amount of memory. I priced out a couple of vendors and when you build out your boxes, the magic number per data node is $10,000.00 USD. (budget this amount per data node.) Moore's Law doesn't drop the price, but it gets you more bang for your buck. Note that this magic number is pre-discount and YMMV. [This also gets in to what is meant by commodity hardware.] I agree that 10GBe is a necessity and I have been looking at it for the past 2 years, only to be shot down by my client's IT group. I agree that Cisco's ToR switches are expensive, however there are Arista and Blade networks switches that claim to be Cisco friendly and aren't too pricey. Somewhere around 10K a box. (Again YMMV). If you want to upgrade existing boxes, you will probably want to look at Solarflare cards. jMHO... Sent from a remote device. Please excuse any typos... Mike Segel On Jun 28, 2011, at 4:59 AM, Steve Loughran ste...@apache.org wrote: On 28/06/11 04:49, Segel, Mike wrote: Hmmm. I could have sworn there was a background balancing bandwidth limiter. There is, for the rebalancer, node outages are taken more seriously, though there have been problems in past 0.20.x where there was a risk of a cascade failure on a big switch/rack failure. The risk has been reduced, though we all await field reports to confirm this :) You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, more things to go wrong with the wiring, etc. The other thing to consider is how well the enterprise switches work in this world -with a Hadoop cluster you can really test those claims how well the switches handle every port lighting up at full rate. Indeed, I recommend that as part of your acceptance tests for the switch.
Re: Hadoop Java Versions
We at Yahoo are about to deploy code to ensure a disk failure on a datanode is just that - a disk failure. Not a node failure. This really helps avoid replication storms. It's in the 0.20.204 branch for the curious. Arun Sent from my iPhone On Jun 28, 2011, at 3:01 AM, Steve Loughran ste...@apache.org wrote: On 28/06/11 04:49, Segel, Mike wrote: Hmmm. I could have sworn there was a background balancing bandwidth limiter. There is, for the rebalancer, node outages are taken more seriously, though there have been problems in past 0.20.x where there was a risk of a cascade failure on a big switch/rack failure. The risk has been reduced, though we all await field reports to confirm this :) You can get 12-24 TB in a server today, which means the loss of a server generates a lot of traffic -which argues for 10 Gbe. But -big increase in switch cost, especially if you (CoI warning) go with Cisco -there have been problems with things like BIOS PXE and lights out management on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting and off the mainboard. This should improve. -I don't know how well linux works with ether that fast (field reports useful) -the big threat is still ToR switch failure, as that will trigger a re-replication of every block in the rack. 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, more things to go wrong with the wiring, etc. The other thing to consider is how well the enterprise switches work in this world -with a Hadoop cluster you can really test those claims how well the switches handle every port lighting up at full rate. Indeed, I recommend that as part of your acceptance tests for the switch.
Re: Hadoop Java Versions
On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails...
Re: Hadoop Java Versions
On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails...
Re: Hadoop Java Versions
Come to Srivas talk at the Summit. On Mon, Jun 27, 2011 at 5:10 PM, Ryan Rawson ryano...@gmail.com wrote: On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails...
Re: Hadoop Java Versions
That doesn't seem right. In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) But the good news... Newer hardware will start to have 10GBe on the motherboard. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote: On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails... The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Re: Hadoop Java Versions
There are no bandwidth limitations in 0.20.x. None that I saw at least. It was basically bandwidth-management-by-pwm. You could adjust the frequency of how many files-per-node were copied. In my case, the load was HBase real time serving, so it was servicing more smaller random reads, not a map-reduce. Everyone has their own use case :-) -ryan On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike mse...@navteq.com wrote: That doesn't seem right. In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) But the good news... Newer hardware will start to have 10GBe on the motherboard. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote: On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails... The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Re: Hadoop Java Versions
For cost reasons, we just bonded two 1G network ports together. A single stream won't go past 1Gbps, but concurrent ones do -- this is with the Linux built-in bonding. The network is only stressed during 'sort-like' jobs or big replication events. We also removed some disk bottlenecks by tuning the file systems aggressively -- using a separate partition for the M/R temp and the location that jars may unpack into helps tremendously. Ext4 can be configured to delay flushing to disk for this temp space, which for small jobs decreases the I/O tremendously as many files are deleted before they get pushed to disk. On 6/27/11 5:10 PM, Ryan Rawson ryano...@gmail.com wrote: On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails...
Re: Hadoop Java Versions
Hmmm. I could have sworn there was a background balancing bandwidth limiter. Haven't tested random reads... The last test we did ended up hitting the cache, but we didn't push it hard enough to hit network bandwidth limitations... Not to say they don't exist. Like I said in the other post, if we had more disks... we would hit it. We'll have to do more random testing. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 9:34 PM, Ryan Rawson ryano...@gmail.com wrote: There are no bandwidth limitations in 0.20.x. None that I saw at least. It was basically bandwidth-management-by-pwm. You could adjust the frequency of how many files-per-node were copied. In my case, the load was HBase real time serving, so it was servicing more smaller random reads, not a map-reduce. Everyone has their own use case :-) -ryan On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike mse...@navteq.com wrote: That doesn't seem right. In one of our test clusters (19 data nodes) we found that under heavy loads we were disk I/O bound and not network bound. Of course YMMV depending on your ToR switch. If we had more than 4 disks per node, we would probably see the network being the bottleneck. What did you set your bandwidth settings in the hdfs-site.xml? ( going from memory not sure of the exact setting...) But the good news... Newer hardware will start to have 10GBe on the motherboard. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote: On the subject of gige vs 10-gige, I think that we will very shortly be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard drive of streaming data. Nodes with 4+ disks are throttled by the network. On a small cluster (20 nodes), the replication traffic can choke a cluster to death. The only way to fix quickly it is to bring that node back up. Perhaps the HortonWorks guys can work on that. -ryan On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote: On 26/06/11 20:23, Scott Carey wrote: On 6/23/11 5:49 AM, Steve Loughranste...@apache.org wrote: what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). that should be enough to find problems. I've just moved up to a 6-core 12 thread desktop and that found problems on some non-Hadoop code, which shows that the more threads you have, and the faster the machines are, the more your race conditions show up. With Hadoop the fact that you can have 10-1000 servers means that in a large cluster the probability of that race condition showing up scales well. Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). Yes, in a large cluster the price of filling the second socket can compare to a lot of storage, and TB of storage is more tangible. I guess it depends on your application. Regarding Sandy Bridge, I've no experience of those, but I worry that 10 Gbps is still bleeding edge, and shouldn't be needed for code with good locality anyway; it is probably more cost effective to stay at 1Gbps/server, though the issue there is the #of HDD/s server generates lots of replication traffic when a single server fails... The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files. The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Re: Hadoop Java Versions
On 6/23/11 5:49 AM, Steve Loughran ste...@apache.org wrote: On 22/06/2011 21:27, Scott Carey wrote: Problems have been reported with Hadoop, the 64-bit JVM and Compressed Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged. I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? A colleague saw it, intermittent JVM crashes. Unless he's updated I can't say the problem has gone away. CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. what's your HW setup? #cores/server, #servers, underlying OS? CentOS 5.6. 4 cores / 8 threads a server (Nehalem generation Intel processor). Also run a smaller cluster with 2x quad core Core 2 generation Xeons. Off topic: The single proc Nehalem is faster than the dual core 2's for most use cases -- and much lower power. Looking forward to single proc 4 or 6 core Sandy Bridge based systems for the next expansion -- testing 4 core vs 4 core has these 30% faster than the Nehalem generation systems in CPU bound tasks and lower power. Intel prices single socket Xeons so much lower than the Dual socket ones that the best value for us is to get more single socket servers rather than fewer dual socket ones (with similar processor to hard drive ratio). We are power constrained per rack either way.
Re: Hadoop Java Versions
On 22/06/2011 21:27, Scott Carey wrote: Problems have been reported with Hadoop, the 64-bit JVM and Compressed Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged. I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? A colleague saw it, intermittent JVM crashes. Unless he's updated I can't say the problem has gone away. CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. what's your HW setup? #cores/server, #servers, underlying OS?
Re: Hadoop Java Versions
Problems have been reported with Hadoop, the 64-bit JVM and Compressed Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged. I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. On 6/14/11 5:16 PM, Allen Wittenauer a...@apache.org wrote: While we're looking at the wiki, could folks update http://wiki.apache.org/hadoop/HadoopJavaVersions with whatever versions of Hadoop they are using successfully? Thanks. P.S., yes, I'm thinking about upgrading ours. :p
Re: Hadoop Java Versions
On Jun 22, 2011, at 1:27 PM, Scott Carey wrote: Problems have been reported with Hadoop, the 64-bit JVM and Compressed Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged. I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? That is definitely dated. Those were problems from 18 and 19. CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. We use compressedoops on 21 every-so-often as well. Feel free to edit that whole section. :) Thanks!
Re: Hadoop Java Versions
On 6/22/11 1:49 PM, Allen Wittenauer a...@apache.org wrote: On Jun 22, 2011, at 1:27 PM, Scott Carey wrote: Problems have been reported with Hadoop, the 64-bit JVM and Compressed Object References (the -XX:+UseCompressedOops option), so use of that option is discouraged. I think the above is dated. It also lacks critical information. What JVM and OS version was the problem seen? That is definitely dated. Those were problems from 18 and 19. CompressedOops had several issues prior to Jre 6u20, and a few minor ones were fixed in u21. FWIW, I now exclusively use 64 bit w/ CompressedOops for all Hadoop and non-Hadoop apps and have seen no issues. It is the default in 6u24 and 6u25 on a 64 bit JVM. We use compressedoops on 21 every-so-often as well. Feel free to edit that whole section. :) Thanks! Updated the CompressedOops section. I am using 6u25 in production and have used 6u24 as well. (64 bit version, CentOS 5.5) We usually roll out JVM's to a subset of nodes, then after a little burn-in to the whole thing. First on development / qa clusters (which are just as busy as production, but smaller), then production.