Re: Hadoop Java Versions

2011-07-13 Thread Eric Baldeschwieler
We could create an apache hadoop list of product selection discussions.  I 
believe this list is intended to be focused on project governance and similar 
discussions.  Maybe we should simply create a governance list and leave this 
one to be the free for all?

On Jul 2, 2011, at 9:16 PM, Ian Holsman wrote:

 
 On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote:
 
 On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote:
 
 open and honest conversations on this thread/group, irrespective of
 whether one represents a company or apache, should be encouraged.
 
 
 Paradoxically, I partially agree with Ian on this.  On the one hand, it is
 important to hear about alternatives in the same eco-system (and alternative
 approaches from other communities).  That is a bit different from Ian's
 view.
 
 While I don't mind that technical alternatives get discussed, I do get PO'd 
 when
 the conversation goes into why product X is better than Y, or when someone 
 makes claims that are incorrect because some 'customer' told them and stuff 
 like that.
 
 If we can keep it at architecture/approaches instead of why a certain product 
 is better than go right ahead.
 
 
 Where I agree with him is that discussions should not be in the form of
 simple breast beating.  Discussions should be driven by data and experience.
 All data and experience are of equal value if they represent quality
 thought.  The only place that company names should figure in this is as a
 bookmark so you can tell which product/release has the characteristic under
 consideration.
 
 
 ... And in that sense we all owe ASF and the hadoop community (and not any
 one company) an equal amount of gratitude, humility and respect.
 
 
 This doesn't get said nearly enough.
 



Re: Hadoop Java Versions

2011-07-13 Thread Ted Dunning
On Wed, Jul 13, 2011 at 7:59 AM, Eric Baldeschwieler eri...@hortonworks.com
 wrote:

 We could create an apache hadoop list of product selection discussions.  I
 believe this list is intended to be focused on project governance and
 similar discussions.  Maybe we should simply create a governance list and
 leave this one to be the free for all?


If you need to separate the discussions, then taking the discussion with
smaller/more experienced set of participants to the new list is going to be
easier than trying to sweep the tide.  A list named general sounds a lot
like general discussions to newcomers (and frankly, to old-timers like
me).

For that matter, is there a strong reason to segregate the discussions?


Re: Hadoop Java Versions

2011-07-02 Thread Ian Holsman

On Jul 2, 2011, at 7:38 AM, Ted Dunning wrote:

 On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote:
 
 open and honest conversations on this thread/group, irrespective of
 whether one represents a company or apache, should be encouraged.
 
 
 Paradoxically, I partially agree with Ian on this.  On the one hand, it is
 important to hear about alternatives in the same eco-system (and alternative
 approaches from other communities).  That is a bit different from Ian's
 view.

While I don't mind that technical alternatives get discussed, I do get PO'd when
the conversation goes into why product X is better than Y, or when someone 
makes claims that are incorrect because some 'customer' told them and stuff 
like that.

If we can keep it at architecture/approaches instead of why a certain product 
is better than go right ahead.

 
 Where I agree with him is that discussions should not be in the form of
 simple breast beating.  Discussions should be driven by data and experience.
 All data and experience are of equal value if they represent quality
 thought.  The only place that company names should figure in this is as a
 bookmark so you can tell which product/release has the characteristic under
 consideration.
 
 
 ... And in that sense we all owe ASF and the hadoop community (and not any
 one company) an equal amount of gratitude, humility and respect.
 
 
 This doesn't get said nearly enough.



1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Evert Lammerts
 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control.
 
 It keeps the impact of dead nodes in control but I don't think thats
 long-term cost efficient.  As prices of 10GbE go down, the keep the node
 small arguement seems less fitting.  And on another note, most servers
 manufactured in the last 10 years have dual 1GbE network interfaces.  If one
 were to go by these calcs:
 
 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32
 minutes to recover
 
 It seems like that assumes a single 1GbE interface, why  not leverage the
 second?

I don't know how others setup up their clusters. We have the tradition that 
every node in a cluster has at least three interfaces - one for interconnects, 
one for a management network (only reachable from within our own network and 
the primary interface for our admins, accessible only through a single 
management node) and one for ILOM, DRAC or whatever lights out manager is 
available. This doesn't leave us room for bonding interfaces on off the shelf 
nodes. Plus - you'd need twice as many ports in your switch.

In the case of Hadoop we're considering adding a fourth NIC for external 
connectivity. We don't want people interacting with HDFS from outside while 
jobs are using the interconnects.

Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb 
ethernet prices are approaching that of 1Gb ethernet it gets more attractive. 
The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared 
to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just 
saying that since we have other variables to tweak - amount of disks and number 
of nodes - we can limit the impact of minimizing recovery times.

Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more 
attractive to stop using HDFS (or at least, data locality) and start using an 
external storage cluster. Compute node failure then has no impact, disk failure 
is hardly noticed by compute nodes. But this is really still very far from as 
cheap as many small nodes with relatively little disks - I really like the bang 
for buck you get with Hadoop :-)

Evert


On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nlwrote:

  You can get 12-24 TB in a server today, which means the loss of a server
  generates a lot of traffic -which argues for 10 Gbe.
 
  But
-big increase in switch cost, especially if you (CoI warning) go with
  Cisco
-there have been problems with things like BIOS PXE and lights out
  management on 10 Gbe -probably due to the NICs being things the BIOS
  wasn't expecting and off the mainboard. This should improve.
-I don't know how well linux works with ether that fast (field reports
  useful)
-the big threat is still ToR switch failure, as that will trigger a
  re-replication of every block in the rack.

 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control. A ToR switch failing is
 different - missing 30 nodes (~120TB) at once cannot be fixed by adding more
 nodes; that actually increases ToR switch failure. Although such failure is
 quite rare to begin with, I guess. The back-of-the-envelope-calculation I
 made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g.,
 when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with
 HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
 should take around 640 seconds. Also see the attached spreadsheet.) This
 doesn't take ToR switch failure in account though. On the other hand - 150
 nodes is only ~5 racks - in such a scenario you might rather want to shut
 the system down completely rather than letting it replicate 20% of all data.

 Cheers,
 Evert


Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Ryan Rawson
What's the justification for a management interface? Doesn't that increase
complexity? Also you still twice the ports?
On Jun 30, 2011 11:54 PM, Evert Lammerts evert.lamme...@sara.nl wrote:
 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control.

 It keeps the impact of dead nodes in control but I don't think thats
 long-term cost efficient. As prices of 10GbE go down, the keep the node
 small arguement seems less fitting. And on another note, most servers
 manufactured in the last 10 years have dual 1GbE network interfaces. If
one
 were to go by these calcs:

 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
~32
 minutes to recover

 It seems like that assumes a single 1GbE interface, why not leverage the
 second?

 I don't know how others setup up their clusters. We have the tradition
that every node in a cluster has at least three interfaces - one for
interconnects, one for a management network (only reachable from within our
own network and the primary interface for our admins, accessible only
through a single management node) and one for ILOM, DRAC or whatever lights
out manager is available. This doesn't leave us room for bonding interfaces
on off the shelf nodes. Plus - you'd need twice as many ports in your
switch.

 In the case of Hadoop we're considering adding a fourth NIC for external
connectivity. We don't want people interacting with HDFS from outside while
jobs are using the interconnects.

 Of course the choice for 1 or 10Gb ethernet is a function of price. As
10Gb ethernet prices are approaching that of 1Gb ethernet it gets more
attractive. The recovery times scale linearly with ethernet speed, so 1Gb
ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a
difference. I'm just saying that since we have other variables to tweak -
amount of disks and number of nodes - we can limit the impact of minimizing
recovery times.

 Another thing to consider is that as 10Gb ethernet gets cheaper, it gets
more attractive to stop using HDFS (or at least, data locality) and start
using an external storage cluster. Compute node failure then has no impact,
disk failure is hardly noticed by compute nodes. But this is really still
very far from as cheap as many small nodes with relatively little disks - I
really like the bang for buck you get with Hadoop :-)

 Evert


 On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl
wrote:

  You can get 12-24 TB in a server today, which means the loss of a
server
  generates a lot of traffic -which argues for 10 Gbe.
 
  But
  -big increase in switch cost, especially if you (CoI warning) go with
  Cisco
  -there have been problems with things like BIOS PXE and lights out
  management on 10 Gbe -probably due to the NICs being things the BIOS
  wasn't expecting and off the mainboard. This should improve.
  -I don't know how well linux works with ether that fast (field reports
  useful)
  -the big threat is still ToR switch failure, as that will trigger a
  re-replication of every block in the rack.

 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control. A ToR switch failing is
 different - missing 30 nodes (~120TB) at once cannot be fixed by adding
more
 nodes; that actually increases ToR switch failure. Although such failure
is
 quite rare to begin with, I guess. The back-of-the-envelope-calculation I
 made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
(e.g.,
 when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
with
 HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
 should take around 640 seconds. Also see the attached spreadsheet.) This
 doesn't take ToR switch failure in account though. On the other hand -
150
 nodes is only ~5 racks - in such a scenario you might rather want to shut
 the system down completely rather than letting it replicate 20% of all
data.

 Cheers,
 Evert


RE: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Evert Lammerts
 What's the justification for a management interface? Doesn't that increase
 complexity? Also you still twice the ports?

That's true. It's a tradition that I haven't questioned before. The reasoning, 
whether right or wrong, is that user jobs (our users are external) can get in 
the way of admins. If there's a lot of network traffic it takes a lot longer to 
re-install a node. It also makes security easier - we have a seperate SSH 
deamon listening on the management net accepting root logins - something the 
default deamon does not do. Of course the same can be done by running a 
seperate deamon on the public interface, on a port that is only accessible from 
our internal network.

Evert

On Jun 30, 2011 11:54 PM, Evert Lammerts evert.lamme...@sara.nl wrote:
 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control.

 It keeps the impact of dead nodes in control but I don't think thats
 long-term cost efficient. As prices of 10GbE go down, the keep the node
 small arguement seems less fitting. And on another note, most servers
 manufactured in the last 10 years have dual 1GbE network interfaces. If
one
 were to go by these calcs:

 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
~32
 minutes to recover

 It seems like that assumes a single 1GbE interface, why not leverage the
 second?

 I don't know how others setup up their clusters. We have the tradition
that every node in a cluster has at least three interfaces - one for
interconnects, one for a management network (only reachable from within our
own network and the primary interface for our admins, accessible only
through a single management node) and one for ILOM, DRAC or whatever lights
out manager is available. This doesn't leave us room for bonding interfaces
on off the shelf nodes. Plus - you'd need twice as many ports in your
switch.

 In the case of Hadoop we're considering adding a fourth NIC for external
connectivity. We don't want people interacting with HDFS from outside while
jobs are using the interconnects.

 Of course the choice for 1 or 10Gb ethernet is a function of price. As
10Gb ethernet prices are approaching that of 1Gb ethernet it gets more
attractive. The recovery times scale linearly with ethernet speed, so 1Gb
ethernet compared to 2Gb bonded ethernet or 10Gb ethernet makes quite a
difference. I'm just saying that since we have other variables to tweak -
amount of disks and number of nodes - we can limit the impact of minimizing
recovery times.

 Another thing to consider is that as 10Gb ethernet gets cheaper, it gets
more attractive to stop using HDFS (or at least, data locality) and start
using an external storage cluster. Compute node failure then has no impact,
disk failure is hardly noticed by compute nodes. But this is really still
very far from as cheap as many small nodes with relatively little disks - I
really like the bang for buck you get with Hadoop :-)

 Evert


 On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl
wrote:

  You can get 12-24 TB in a server today, which means the loss of a
server
  generates a lot of traffic -which argues for 10 Gbe.
 
  But
  -big increase in switch cost, especially if you (CoI warning) go with
  Cisco
  -there have been problems with things like BIOS PXE and lights out
  management on 10 Gbe -probably due to the NICs being things the BIOS
  wasn't expecting and off the mainboard. This should improve.
  -I don't know how well linux works with ether that fast (field reports
  useful)
  -the big threat is still ToR switch failure, as that will trigger a
  re-replication of every block in the rack.

 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control. A ToR switch failing is
 different - missing 30 nodes (~120TB) at once cannot be fixed by adding
more
 nodes; that actually increases ToR switch failure. Although such failure
is
 quite rare to begin with, I guess. The back-of-the-envelope-calculation I
 made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
(e.g.,
 when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
with
 HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
 should take around 640 seconds. Also see the attached spreadsheet.) This
 doesn't take ToR switch failure in account though. On the other hand -
150
 nodes is only ~5 racks - in such a scenario you might rather want to shut
 the system down completely rather than letting it replicate 20% of all
data.

 Cheers,
 Evert


Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Steve Loughran

On 01/07/2011 07:41, Evert Lammerts wrote:

Keeping the amount of disks per node low and the amount of nodes high
should keep the impact of dead nodes in control.


It keeps the impact of dead nodes in control but I don't think thats
long-term cost efficient.  As prices of 10GbE go down, the keep the node
small arguement seems less fitting.  And on another note, most servers
manufactured in the last 10 years have dual 1GbE network interfaces.  If one
were to go by these calcs:


150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32

minutes to recover

It seems like that assumes a single 1GbE interface, why  not leverage the
second?


I don't know how others setup up their clusters. We have the tradition that 
every node in a cluster has at least three interfaces - one for interconnects, 
one for a management network (only reachable from within our own network and 
the primary interface for our admins, accessible only through a single 
management node) and one for ILOM, DRAC or whatever lights out manager is 
available. This doesn't leave us room for bonding interfaces on off the shelf 
nodes. Plus - you'd need twice as many ports in your switch.


Yes, I didn't get into ILO. That can be 100Mbps.


In the case of Hadoop we're considering adding a fourth NIC for external 
connectivity. We don't want people interacting with HDFS from outside while 
jobs are using the interconnects.

Of course the choice for 1 or 10Gb ethernet is a function of price. As 10Gb 
ethernet prices are approaching that of 1Gb ethernet it gets more attractive. 
The recovery times scale linearly with ethernet speed, so 1Gb ethernet compared 
to 2Gb bonded ethernet or 10Gb ethernet makes quite a difference. I'm just 
saying that since we have other variables to tweak - amount of disks and number 
of nodes - we can limit the impact of minimizing recovery times.


2Gb bonded with 2x ToR removes a single ToR switch as an SPOF for the 
rack but increases install/debugging costs. More wires, and your network 
configuration topology has got worse.




Another thing to consider is that as 10Gb ethernet gets cheaper, it gets more 
attractive to stop using HDFS (or at least, data locality) and start using an 
external storage cluster. Compute node failure then has no impact, disk failure 
is hardly noticed by compute nodes. But this is really still very far from as 
cheap as many small nodes with relatively little disks - I really like the bang 
for buck you get with Hadoop :-)


Yeah, but you are a computing facility whose backbone gets data off the 
LHC tier 1 sites abroad faster than the seek time of the disks in your 
neighbouring building...


Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Steve Loughran

On 01/07/2011 08:16, Ryan Rawson wrote:

What's the justification for a management interface? Doesn't that increase
complexity? Also you still twice the ports?


ILO reduces ops complexity. you can push things like BIOS updates out, 
boot machines into known states, instead of the slowly diverging world 
that RPM or debian updates get you into (the final state depends on the 
order, and different machines end up applying them in a different 
order), and for diagnosing problems when even the root disk doesn't want 
to come out and play.


In a big cluster you need to worry about things like not powering on a 
quadrant of the site simultaneously as boot-time can be a peak power 
surge; you may want to bring up slices of every rack gradually, upgrade 
the BIOS and OS, and gradually ramp things up. This is where HPC admin 
tools differ from classic lock down an end user windows PC tooling.




Re: Hadoop Java Versions

2011-07-01 Thread Steve Loughran

On 01/07/2011 01:16, Ted Dunning wrote:

You have to consider the long-term reliability as well.

Losing an entire set of 10 or 12 disks at once makes the overall reliability
of a large cluster very suspect.  This is because it becomes entirely too
likely that two additional drives will fail before the data on the off-line
node can be replicated.  For 100 nodes, that can decrease the average time
to data loss down to less than a year.


There's also Rodrigo's work on alternate block placement that doesn't 
scatter blocks quite so randomly across a cluster, so a loss of a node 
or rack doesn't have adverse effects on so many files


https://issues.apache.org/jira/browse/HDFS-1094

Given that most HDDs failures happen on cluster reboot, it is possible 
for 10-12 disks not to come up at the same time, if the cluster has been 
up for a while, but like Todd says -worry. At least a bit.



I've heard hints of one FS that actually includes HDD batch data in 
block placement, to try and scatter data across batches, and be biased 
towards using new HDDs for temp storage during burn-in. Some research 
work on doing that to HDFS could be something to keep some postgraduate 
busy for a while, Disk batch-aware block placement.



This can only be mitigated in stock
hadoop by keeping the number of drives relatively low.


now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? 
Because it seems to me that stock HDFS's use in production makes it 
one of the filesystems in the planet with the most number of non-RAIDed 
HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP 
IBRIX (the last two of which have some form of Hadoop support too, if 
you ask nicely). HDD/server numbers matter in that in a small cluster, 
it's better to have fewer machines to get more servers to spread the 
data over; you don't really want your 100 TB in three 1U servers. As 
your cluster grows -and you care more about storage capacity than raw 
compute- then the appeal of 24+ TB/server starts to look good, and 
that's when you care about the improvements to datanodes handling loss 
of worker disk better. Even without that, rebooting the DN may fix 
things, but the impact on ongoing work is the big issue -you don't just 
lose a replicated block, you lose data.


Cascade failures leading to cluster outages are a separate issue and 
normally triggered by switch failure/config than anything else. It 
doesn't matter how reliable the hardware is if it gets the wrong 
configuration




Re: 1Gb vs 10Gb eth (WAS: Re: Hadoop Java Versions)

2011-07-01 Thread Ted Dunning
You are twice the ports, but if you know that the management port is not
used for serious data, then you can put a SOHO grade switch on those ports
at negligible cost.

There is a serious conflict of goals here if you have software that can make
serious use more than one NIC.  On the one hand, it is nice to use the
hardware you have.  On the other, it is really nice to have guaranteed
bandwidth for management.

With traditional switch level link aggregation this is usually not a problem
since each flow is committed to one NIC or the other, resulting in poor
balancing.  The silver lining of the poor balancing is that there is always
plenty of bandwidth for administration function.

Though it may be slightly controversial to mention, the way that MapRdoes
application level NIC bonding could cause some backlash from ops staff
because it can actually saturate as many NIC's as you make available.
 Generally, management doesn't require much bandwidth and doing things like
reloading the BIOS is usually done when a machine is out of service for
maintenance, but the potential for surprise is there.

I have definitely seen a number of conditions where service access is a
complete god-send.  With geographically dispersed data centers, it is an
absolute requirement because you just can't staff every data center with
enough hands-on admins within a few minutes travel time of the data center.
 ILO (or DRAC as Dell calls it) gives you 5 minute response time to fixing
totally horked machines.

On Fri, Jul 1, 2011 at 8:47 AM, Steve Loughran ste...@apache.org wrote:

 On 01/07/2011 08:16, Ryan Rawson wrote:

 What's the justification for a management interface? Doesn't that increase
 complexity? Also you still twice the ports?


 ILO reduces ops complexity. you can push things like BIOS updates out, boot
 machines into known states, instead of the slowly diverging world that RPM
 or debian updates get you into (the final state depends on the order, and
 different machines end up applying them in a different order), and for
 diagnosing problems when even the root disk doesn't want to come out and
 play.

 In a big cluster you need to worry about things like not powering on a
 quadrant of the site simultaneously as boot-time can be a peak power surge;
 you may want to bring up slices of every rack gradually, upgrade the BIOS
 and OS, and gradually ramp things up. This is where HPC admin tools differ
 from classic lock down an end user windows PC tooling.




Re: Hadoop Java Versions

2011-07-01 Thread Scott Carey
Although this thread is wandering a bit, I disagree strongly that it is
inappropriate to discuss other vendor specific features (or competing
compute platform features) on general@.  The topic has become the factors
that influence hardware purchase choices, and one of those is how the
system deals with disk failure.  Compare/contrast with other platforms is
healthy for the Hadoop project!

On 6/30/11 9:47 PM, Ian Holsman had...@holsman.net wrote:


On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote:

 On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:
 
 
 I'd advise you to look at stock hadoop again. This used to be true,
but
 was fixed a long while back by HDFS-457 and several followup JIRAs.
 
 If MapR does something fancier, I'm sure we'd be interested to hear
about
 it
 so we can compare the approaches.
 
 -Todd
 
 
 MapR tracks disk responsiveness. In other words, a moving histogram of
 IO-completion times is maintained internally, and if a disk starts
getting
 really slow, it is pre-emptively taken offline so it does not create
long
 tails for running jobs (and the data on the disk is re-replicated using
 whatever re-replication policy is in place).  One of the benefits of
 managing the disks directly instead of through ext3 / xfs / or other ...
 
 All these stats can be fed into Ganglia (or pushed out centrally via a
text
 file that can be pulled out using NFS)  if historical info about disk
 behavior (and failures) needs to be preserved.
 
 - Srivas.

While I am intrigued about how MapR performs internally, I don't think
this is the forum for it.
please keep MapR (and other vendor specific discussions) on their
respective support forums.

Thanks!

Ian.




Re: Hadoop Java Versions

2011-07-01 Thread Ted Dunning
On Fri, Jul 1, 2011 at 9:12 AM, Steve Loughran ste...@apache.org wrote:

 On 01/07/2011 01:16, Ted Dunning wrote:

 You have to consider the long-term reliability as well.

 Losing an entire set of 10 or 12 disks at once makes the overall
 reliability
 of a large cluster very suspect.  This is because it becomes entirely too
 likely that two additional drives will fail before the data on the
 off-line
 node can be replicated.  For 100 nodes, that can decrease the average time
 to data loss down to less than a year.


 There's also Rodrigo's work on alternate block placement that doesn't
 scatter blocks quite so randomly across a cluster, so a loss of a node or
 rack doesn't have adverse effects on so many files

 https://issues.apache.org/**jira/browse/HDFS-1094https://issues.apache.org/jira/browse/HDFS-1094


I did calculations based on this as well.  The heuristic level of the
computation is pretty simple, but to go any deeper, you have a pretty hair
computation.  My own approach was to use Monte Carlo Markov Chain to sample
from the failure mode distribution.  The codes that I wrote for this used
pluggable placement, replication and failure models.

I may have lacked sufficient cleverness at the time, but it was very
difficult to come up with structured placement policies that actually
improved the failure probabilities.  Most such strategies massively
decreased average probabilities.  My suspicion by analogy with large
structured error correction codes is that there are structured placement
policies that perform well, but that in reasonably large clusters (number of
disks  50, say), that random placement will be within epsilon of the best
possible strategy with very high probability.



 Given that most HDDs failures happen on cluster reboot, it is possible for
 10-12 disks not to come up at the same time, if the cluster has been up for
 a while, but like Todd says -worry. At least a bit.


Indeed.

Thank goodness also that disk manufacturers tend to be pessimistic in
quoting MTBF.

These possibilities of correlated failure seriously complicate these
computations, of course.


 I've heard hints of one FS that actually includes HDD batch data in block
 placement, to try and scatter data across batches, and be biased towards
 using new HDDs for temp storage during burn-in. Some research work on doing
 that to HDFS could be something to keep some postgraduate busy for a while,
 Disk batch-aware block placement.


Sadly, I can't comment on my knowledge of this except to say that there are
non-obvious solutions to this that are embedded in at least one commercial
map-reduce related product.  I can't say which without getting chastised.


  This can only be mitigated in stock
 hadoop by keeping the number of drives relatively low.


 now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem?


Per system.


 ..with the most number of non-RAIDed HDDs out there -things like Lustre and
 IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form
 of Hadoop support too, if you ask nicely). HDD/server numbers matter in that
 in a small cluster, it's better to have fewer machines to get more servers
 to spread the data over; you don't really want your 100 TB in three 1U
 servers. As your cluster grows -and you care more about storage capacity
 than raw compute- then the appeal of 24+ TB/server starts to look good, and
 that's when you care about the improvements to datanodes handling loss of
 worker disk better. Even without that, rebooting the DN may fix things, but
 the impact on ongoing work is the big issue -you don't just lose a
 replicated block, you lose data.


Generally, I agree with what you say.  The effect of RAID is to squeeze the
error distributions around so that partial failures have lower probability.
  This is complex in the aggregate.



 Cascade failures leading to cluster outages are a separate issue and
 normally triggered by switch failure/config than anything else. It doesn't
 matter how reliable the hardware is if it gets the wrong configuration


Indeed.


Re: Hadoop Java Versions

2011-07-01 Thread Abhishek Mehta
i definitely agree with scott.  as a user of the hadoop open source stack for 
building our banking focused big data analytics applications, i speak on behalf 
of our clients and the emerging hadoop eco-system that open and honest 
conversations on this thread/group, irrespective of whether one represents a 
company or apache, should be encouraged.

as an instance, with the fact that cloudera, mapR and soon hortonworks are all 
going to be offering competing hadoop distros for enterprises, it is important 
for all of us (and prospective users) to understand what they are doing to 
address critical gaps on the platform, and how the hadoop ecosystem benefits 
from it.  

From our perspective, it doesn't matter if one is better than the other (which 
is not the point i saw ted or mc making), but that companies, startups, apache 
and everybody else:

1.  is thinking of the right issues
2.  willing to solve them (and ideally contributing the solutions back) and
3.  informing the exploding hadoop userbase of what not to do

I see it benefitting all of us, especially as Hadoop rapidly jumps the transom 
and becomes the platform of choice for data management in industries like 
banking, retail and healthcare...just as it has for social media and the web...

isn't that what we are launching our business plans around anyway...

And in that sense we all owe ASF and the hadoop community (and not any one 
company) an equal amount of gratitude, humility and respect.  


On Jul 1, 2011, at 1:22 PM, Scott Carey wrote:

 Although this thread is wandering a bit, I disagree strongly that it is
 inappropriate to discuss other vendor specific features (or competing
 compute platform features) on general@.  The topic has become the factors
 that influence hardware purchase choices, and one of those is how the
 system deals with disk failure.  Compare/contrast with other platforms is
 healthy for the Hadoop project! +1
 
 On 6/30/11 9:47 PM, Ian Holsman had...@holsman.net wrote:
 
 
 On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote:
 
 On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:
 
 
 I'd advise you to look at stock hadoop again. This used to be true,
 but
 was fixed a long while back by HDFS-457 and several followup JIRAs.
 
 If MapR does something fancier, I'm sure we'd be interested to hear
 about
 it
 so we can compare the approaches.
 
 -Todd
 
 
 MapR tracks disk responsiveness. In other words, a moving histogram of
 IO-completion times is maintained internally, and if a disk starts
 getting
 really slow, it is pre-emptively taken offline so it does not create
 long
 tails for running jobs (and the data on the disk is re-replicated using
 whatever re-replication policy is in place).  One of the benefits of
 managing the disks directly instead of through ext3 / xfs / or other ...
 
 All these stats can be fed into Ganglia (or pushed out centrally via a
 text
 file that can be pulled out using NFS)  if historical info about disk
 behavior (and failures) needs to be preserved.
 
 - Srivas.
 
 While I am intrigued about how MapR performs internally, I don't think
 this is the forum for it.
 please keep MapR (and other vendor specific discussions) on their
 respective support forums.
 
 Thanks!
 
 Ian.
 
 



Re: Hadoop Java Versions

2011-07-01 Thread Ted Dunning
On Fri, Jul 1, 2011 at 11:53 AM, Abhishek Mehta abhis...@tresata.comwrote:

  open and honest conversations on this thread/group, irrespective of
 whether one represents a company or apache, should be encouraged.


Paradoxically, I partially agree with Ian on this.  On the one hand, it is
important to hear about alternatives in the same eco-system (and alternative
approaches from other communities).  That is a bit different from Ian's
view.

Where I agree with him is that discussions should not be in the form of
simple breast beating.  Discussions should be driven by data and experience.
 All data and experience are of equal value if they represent quality
thought.  The only place that company names should figure in this is as a
bookmark so you can tell which product/release has the characteristic under
consideration.


 ... And in that sense we all owe ASF and the hadoop community (and not any
 one company) an equal amount of gratitude, humility and respect.


This doesn't get said nearly enough.


RE: Hadoop Java Versions

2011-06-30 Thread Evert Lammerts
 You can get 12-24 TB in a server today, which means the loss of a server
 generates a lot of traffic -which argues for 10 Gbe.
 
 But
   -big increase in switch cost, especially if you (CoI warning) go with
 Cisco
   -there have been problems with things like BIOS PXE and lights out
 management on 10 Gbe -probably due to the NICs being things the BIOS
 wasn't expecting and off the mainboard. This should improve.
   -I don't know how well linux works with ether that fast (field reports
 useful)
   -the big threat is still ToR switch failure, as that will trigger a
 re-replication of every block in the rack.

Keeping the amount of disks per node low and the amount of nodes high should 
keep the impact of dead nodes in control. A ToR switch failing is different - 
missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that 
actually increases ToR switch failure. Although such failure is quite rare to 
begin with, I guess. The back-of-the-envelope-calculation I made suggests that 
~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a 
cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes 
around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. 
Also see the attached spreadsheet.) This doesn't take ToR switch failure in 
account though. On the other hand - 150 nodes is only ~5 racks - in such a 
scenario you might rather want to shut the system down completely rather than 
letting it replicate 20% of all data.

Cheers,
Evert

RE: Hadoop Java Versions

2011-06-30 Thread Evert Lammerts
That's not a question I'm qualified to answer. I do know we're now buying an 
Arista for a different cluster, but there's probably loads others out there.

*forwarded to general@...*


From: Abhishek Mehta [abhis...@tresata.com]
Sent: Thursday, June 30, 2011 11:38 PM
To: Evert Lammerts
Subject: Fwd: Hadoop Java Versions

what are the other switch options (other than cisco that is?)?

cheers


Abhishek Mehta
(e) abhis...@tresata.commailto:abhis...@tresata.com
(v) 980.355.9855

Begin forwarded message:

From: Evert Lammerts evert.lamme...@sara.nlmailto:evert.lamme...@sara.nl
Date: June 30, 2011 5:31:26 PM EDT
To: general@hadoop.apache.orgmailto:general@hadoop.apache.org 
general@hadoop.apache.orgmailto:general@hadoop.apache.org
Subject: RE: Hadoop Java Versions
Reply-To: general@hadoop.apache.orgmailto:general@hadoop.apache.org

You can get 12-24 TB in a server today, which means the loss of a server
generates a lot of traffic -which argues for 10 Gbe.

But
 -big increase in switch cost, especially if you (CoI warning) go with
Cisco
 -there have been problems with things like BIOS PXE and lights out
management on 10 Gbe -probably due to the NICs being things the BIOS
wasn't expecting and off the mainboard. This should improve.
 -I don't know how well linux works with ether that fast (field reports
useful)
 -the big threat is still ToR switch failure, as that will trigger a
re-replication of every block in the rack.

Keeping the amount of disks per node low and the amount of nodes high should 
keep the impact of dead nodes in control. A ToR switch failing is different - 
missing 30 nodes (~120TB) at once cannot be fixed by adding more nodes; that 
actually increases ToR switch failure. Although such failure is quite rare to 
begin with, I guess. The back-of-the-envelope-calculation I made suggests that 
~150 (1U) nodes should be fine with 1Gb ethernet. (e.g., when 6 nodes fail in a 
cluster with 150 nodes with four 2TB disks each, with HDFS 60% full, it takes 
around ~32 minutes to recover. 2 nodes failing should take around 640 seconds. 
Also see the attached spreadsheet.) This doesn't take ToR switch failure in 
account though. On the other hand - 150 nodes is only ~5 racks - in such a 
scenario you might rather want to shut the system down completely rather than 
letting it replicate 20% of all data.

Cheers,
Evert



Re: Hadoop Java Versions

2011-06-30 Thread Aaron Eng
Keeping the amount of disks per node low and the amount of nodes high
should keep the impact of dead nodes in control.

It keeps the impact of dead nodes in control but I don't think thats
long-term cost efficient.  As prices of 10GbE go down, the keep the node
small arguement seems less fitting.  And on another note, most servers
manufactured in the last 10 years have dual 1GbE network interfaces.  If one
were to go by these calcs:

150 nodes with four 2TB disks each, with HDFS 60% full, it takes around ~32
minutes to recover

It seems like that assumes a single 1GbE interface, why  not leverage the
second?

On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nlwrote:

  You can get 12-24 TB in a server today, which means the loss of a server
  generates a lot of traffic -which argues for 10 Gbe.
 
  But
-big increase in switch cost, especially if you (CoI warning) go with
  Cisco
-there have been problems with things like BIOS PXE and lights out
  management on 10 Gbe -probably due to the NICs being things the BIOS
  wasn't expecting and off the mainboard. This should improve.
-I don't know how well linux works with ether that fast (field reports
  useful)
-the big threat is still ToR switch failure, as that will trigger a
  re-replication of every block in the rack.

 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control. A ToR switch failing is
 different - missing 30 nodes (~120TB) at once cannot be fixed by adding more
 nodes; that actually increases ToR switch failure. Although such failure is
 quite rare to begin with, I guess. The back-of-the-envelope-calculation I
 made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet. (e.g.,
 when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each, with
 HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
 should take around 640 seconds. Also see the attached spreadsheet.) This
 doesn't take ToR switch failure in account though. On the other hand - 150
 nodes is only ~5 racks - in such a scenario you might rather want to shut
 the system down completely rather than letting it replicate 20% of all data.

 Cheers,
 Evert


Re: Hadoop Java Versions

2011-06-30 Thread Ted Dunning
You have to consider the long-term reliability as well.

Losing an entire set of 10 or 12 disks at once makes the overall reliability
of a large cluster very suspect.  This is because it becomes entirely too
likely that two additional drives will fail before the data on the off-line
node can be replicated.  For 100 nodes, that can decrease the average time
to data loss down to less than a year.  This can only be mitigated in stock
hadoop by keeping the number of drives relatively low.  MapR avoids this by
not failing nodes for trivial problems.

On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng a...@maprtech.com wrote:

 Keeping the amount of disks per node low and the amount of nodes high
 should keep the impact of dead nodes in control.

 It keeps the impact of dead nodes in control but I don't think thats
 long-term cost efficient.  As prices of 10GbE go down, the keep the node
 small arguement seems less fitting.  And on another note, most servers
 manufactured in the last 10 years have dual 1GbE network interfaces.  If
 one
 were to go by these calcs:

 150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
 ~32
 minutes to recover

 It seems like that assumes a single 1GbE interface, why  not leverage the
 second?

 On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl
 wrote:

   You can get 12-24 TB in a server today, which means the loss of a
 server
   generates a lot of traffic -which argues for 10 Gbe.
  
   But
 -big increase in switch cost, especially if you (CoI warning) go with
   Cisco
 -there have been problems with things like BIOS PXE and lights out
   management on 10 Gbe -probably due to the NICs being things the BIOS
   wasn't expecting and off the mainboard. This should improve.
 -I don't know how well linux works with ether that fast (field
 reports
   useful)
 -the big threat is still ToR switch failure, as that will trigger a
   re-replication of every block in the rack.
 
  Keeping the amount of disks per node low and the amount of nodes high
  should keep the impact of dead nodes in control. A ToR switch failing is
  different - missing 30 nodes (~120TB) at once cannot be fixed by adding
 more
  nodes; that actually increases ToR switch failure. Although such failure
 is
  quite rare to begin with, I guess. The back-of-the-envelope-calculation I
  made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
 (e.g.,
  when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
 with
  HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
  should take around 640 seconds. Also see the attached spreadsheet.) This
  doesn't take ToR switch failure in account though. On the other hand -
 150
  nodes is only ~5 racks - in such a scenario you might rather want to shut
  the system down completely rather than letting it replicate 20% of all
 data.
 
  Cheers,
  Evert



Re: Hadoop Java Versions

2011-06-30 Thread Todd Lipcon
On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning tdunn...@maprtech.com wrote:

 You have to consider the long-term reliability as well.

 Losing an entire set of 10 or 12 disks at once makes the overall
 reliability
 of a large cluster very suspect.  This is because it becomes entirely too
 likely that two additional drives will fail before the data on the off-line
 node can be replicated.  For 100 nodes, that can decrease the average time
 to data loss down to less than a year.  This can only be mitigated in stock
 hadoop by keeping the number of drives relatively low.  MapR avoids this by
 not failing nodes for trivial problems.


I'd advise you to look at stock hadoop again. This used to be true, but
was fixed a long while back by HDFS-457 and several followup JIRAs.

If MapR does something fancier, I'm sure we'd be interested to hear about it
so we can compare the approaches.

-Todd



 On Thu, Jun 30, 2011 at 4:18 PM, Aaron Eng a...@maprtech.com wrote:

  Keeping the amount of disks per node low and the amount of nodes high
  should keep the impact of dead nodes in control.
 
  It keeps the impact of dead nodes in control but I don't think thats
  long-term cost efficient.  As prices of 10GbE go down, the keep the node
  small arguement seems less fitting.  And on another note, most servers
  manufactured in the last 10 years have dual 1GbE network interfaces.  If
  one
  were to go by these calcs:
 
  150 nodes with four 2TB disks each, with HDFS 60% full, it takes around
  ~32
  minutes to recover
 
  It seems like that assumes a single 1GbE interface, why  not leverage the
  second?
 
  On Thu, Jun 30, 2011 at 2:31 PM, Evert Lammerts evert.lamme...@sara.nl
  wrote:
 
You can get 12-24 TB in a server today, which means the loss of a
  server
generates a lot of traffic -which argues for 10 Gbe.
   
But
  -big increase in switch cost, especially if you (CoI warning) go
 with
Cisco
  -there have been problems with things like BIOS PXE and lights out
management on 10 Gbe -probably due to the NICs being things the BIOS
wasn't expecting and off the mainboard. This should improve.
  -I don't know how well linux works with ether that fast (field
  reports
useful)
  -the big threat is still ToR switch failure, as that will trigger a
re-replication of every block in the rack.
  
   Keeping the amount of disks per node low and the amount of nodes high
   should keep the impact of dead nodes in control. A ToR switch failing
 is
   different - missing 30 nodes (~120TB) at once cannot be fixed by adding
  more
   nodes; that actually increases ToR switch failure. Although such
 failure
  is
   quite rare to begin with, I guess. The back-of-the-envelope-calculation
 I
   made suggests that ~150 (1U) nodes should be fine with 1Gb ethernet.
  (e.g.,
   when 6 nodes fail in a cluster with 150 nodes with four 2TB disks each,
  with
   HDFS 60% full, it takes around ~32 minutes to recover. 2 nodes failing
   should take around 640 seconds. Also see the attached spreadsheet.)
 This
   doesn't take ToR switch failure in account though. On the other hand -
  150
   nodes is only ~5 racks - in such a scenario you might rather want to
 shut
   the system down completely rather than letting it replicate 20% of all
  data.
  
   Cheers,
   Evert
 




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Hadoop Java Versions

2011-06-30 Thread Ted Dunning
Good point Todd.

I was speaking from the experience of people I know who are using 0.20.x

On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:

 On Thu, Jun 30, 2011 at 5:16 PM, Ted Dunning tdunn...@maprtech.com
 wrote:

  You have to consider the long-term reliability as well.
 
  Losing an entire set of 10 or 12 disks at once makes the overall
  reliability
  of a large cluster very suspect.  This is because it becomes entirely too
  likely that two additional drives will fail before the data on the
 off-line
  node can be replicated.  For 100 nodes, that can decrease the average
 time
  to data loss down to less than a year.  This can only be mitigated in
 stock
  hadoop by keeping the number of drives relatively low.  MapR avoids this
 by
  not failing nodes for trivial problems.
 

 I'd advise you to look at stock hadoop again. This used to be true, but
 was fixed a long while back by HDFS-457 and several followup JIRAs.

 If MapR does something fancier, I'm sure we'd be interested to hear about
 it
 so we can compare the approaches.



Re: Hadoop Java Versions

2011-06-30 Thread M. C. Srivas
On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:


 I'd advise you to look at stock hadoop again. This used to be true, but
 was fixed a long while back by HDFS-457 and several followup JIRAs.

 If MapR does something fancier, I'm sure we'd be interested to hear about
 it
 so we can compare the approaches.

 -Todd


MapR tracks disk responsiveness. In other words, a moving histogram of
IO-completion times is maintained internally, and if a disk starts getting
really slow, it is pre-emptively taken offline so it does not create long
tails for running jobs (and the data on the disk is re-replicated using
whatever re-replication policy is in place).  One of the benefits of
managing the disks directly instead of through ext3 / xfs / or other ...

All these stats can be fed into Ganglia (or pushed out centrally via a text
file that can be pulled out using NFS)  if historical info about disk
behavior (and failures) needs to be preserved.

- Srivas.


Re: Hadoop Java Versions

2011-06-30 Thread Ian Holsman

On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote:

 On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:
 
 
 I'd advise you to look at stock hadoop again. This used to be true, but
 was fixed a long while back by HDFS-457 and several followup JIRAs.
 
 If MapR does something fancier, I'm sure we'd be interested to hear about
 it
 so we can compare the approaches.
 
 -Todd
 
 
 MapR tracks disk responsiveness. In other words, a moving histogram of
 IO-completion times is maintained internally, and if a disk starts getting
 really slow, it is pre-emptively taken offline so it does not create long
 tails for running jobs (and the data on the disk is re-replicated using
 whatever re-replication policy is in place).  One of the benefits of
 managing the disks directly instead of through ext3 / xfs / or other ...
 
 All these stats can be fed into Ganglia (or pushed out centrally via a text
 file that can be pulled out using NFS)  if historical info about disk
 behavior (and failures) needs to be preserved.
 
 - Srivas.

While I am intrigued about how MapR performs internally, I don't think this is 
the forum for it.
please keep MapR (and other vendor specific discussions) on their respective 
support forums.

Thanks!

Ian.



Re: Hadoop Java Versions

2011-06-30 Thread M. C. Srivas
No worries.

I read Todd's post as asking for elaboration ... sometimes knowing what
another similar system does helps in improving your own.


On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman had...@holsman.net wrote:


 On Jul 1, 2011, at 2:08 PM, M. C. Srivas wrote:

  On Thu, Jun 30, 2011 at 5:24 PM, Todd Lipcon t...@cloudera.com wrote:
 
 
  I'd advise you to look at stock hadoop again. This used to be true,
 but
  was fixed a long while back by HDFS-457 and several followup JIRAs.
 
  If MapR does something fancier, I'm sure we'd be interested to hear
 about
  it
  so we can compare the approaches.
 
  -Todd
 
 
  MapR tracks disk responsiveness. In other words, a moving histogram of
  IO-completion times is maintained internally, and if a disk starts
 getting
  really slow, it is pre-emptively taken offline so it does not create long
  tails for running jobs (and the data on the disk is re-replicated using
  whatever re-replication policy is in place).  One of the benefits of
  managing the disks directly instead of through ext3 / xfs / or other ...
 
  All these stats can be fed into Ganglia (or pushed out centrally via a
 text
  file that can be pulled out using NFS)  if historical info about disk
  behavior (and failures) needs to be preserved.
 
  - Srivas.

 While I am intrigued about how MapR performs internally, I don't think this
 is the forum for it.
 please keep MapR (and other vendor specific discussions) on their
 respective support forums.

 Thanks!

 Ian.




Re: Hadoop Java Versions

2011-06-30 Thread Ted Dunning
Ian,

Good point.  Srivas was responding to Todd's question, but there might be
better fora as you suggest.

We have a good one for specific questions about MapR at
http://answers.mapr.com

That doesn't, however, really provide a useful forum for questions like
Todd's which really spans both domains.  Where would you suggest for
questions that span the two subjects?

On Thu, Jun 30, 2011 at 9:47 PM, Ian Holsman had...@holsman.net wrote:

  If MapR does something fancier, I'm sure we'd be interested to hear
 about
  it
  so we can compare the approaches.
 
  -Todd
 
 
  MapR tracks disk responsiveness. In other words, a moving histogram of
  IO-completion times is maintained internally, and if a disk starts
 getting
  really slow, it is pre-emptively taken offline so it does not create long
  tails for running jobs (and the data on the disk is re-replicated using
  whatever re-replication policy is in place).  One of the benefits of
  managing the disks directly instead of through ext3 / xfs / or other ...
 
  All these stats can be fed into Ganglia (or pushed out centrally via a
 text
  file that can be pulled out using NFS)  if historical info about disk
  behavior (and failures) needs to be preserved.
 
  - Srivas.

 While I am intrigued about how MapR performs internally, I don't think this
 is the forum for it.
 please keep MapR (and other vendor specific discussions) on their
 respective support forums.



Re: Hadoop Java Versions

2011-06-28 Thread Steve Loughran

On 28/06/11 04:49, Segel, Mike wrote:

Hmmm. I could have sworn there was a background balancing bandwidth limiter.


There is, for the rebalancer, node outages are taken more seriously, 
though there have been problems in past 0.20.x where there was a risk of 
a cascade failure on a big switch/rack failure. The risk has been 
reduced, though we all await field reports to confirm this :)


You can get 12-24 TB in a server today, which means the loss of a server 
generates a lot of traffic -which argues for 10 Gbe.


But
 -big increase in switch cost, especially if you (CoI warning) go with 
Cisco
 -there have been problems with things like BIOS PXE and lights out 
management on 10 Gbe -probably due to the NICs being things the BIOS 
wasn't expecting and off the mainboard. This should improve.
 -I don't know how well linux works with ether that fast (field reports 
useful)
 -the big threat is still ToR switch failure, as that will trigger a 
re-replication of every block in the rack.


2x1 Gbe lets you have redundant switches, albeit at the price of more 
wiring, more things to go wrong with the wiring, etc.


The other thing to consider is how well the enterprise switches work 
in this world -with a Hadoop cluster you can really test those claims 
how well the switches handle every port lighting up at full rate. 
Indeed, I recommend that as part of your acceptance tests for the switch.





Re: Hadoop Java Versions

2011-06-28 Thread Michel Segel
You're preaching to the choir. :-)

With Sandybridge, you're going to start seeing 10 GBe on the motherboard.
We built our clusters using 1U boxes where you're stuck w 4 3.5 drives. With 
larger chassis,
You can fit an additional controller card and more drives.

More drives reduces the bottleneck and means  your performance will  be 
throttled by your network and the amount of memory.

I priced out a couple of vendors and when you build out your boxes, the magic 
number per data node is $10,000.00 USD. (budget this amount per data node.) 
Moore's Law doesn't drop the price, but it gets you more bang for your buck. 
Note that this magic number is pre-discount and YMMV. [This also gets in to 
what is meant by commodity hardware.]

I agree that 10GBe is a necessity and I have been looking at it for the past 2 
years, only to be shot down by my client's IT group. I agree that Cisco's ToR 
switches are expensive, however there are Arista and Blade networks switches 
that claim to be Cisco friendly and aren't too pricey. Somewhere around 10K a 
box. (Again YMMV).

If you want to upgrade existing boxes, you will probably want to look at 
Solarflare cards. 

jMHO...


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 28, 2011, at 4:59 AM, Steve Loughran ste...@apache.org wrote:

 On 28/06/11 04:49, Segel, Mike wrote:
 Hmmm. I could have sworn there was a background balancing bandwidth limiter.
 
 There is, for the rebalancer, node outages are taken more seriously, though 
 there have been problems in past 0.20.x where there was a risk of a cascade 
 failure on a big switch/rack failure. The risk has been reduced, though we 
 all await field reports to confirm this :)
 
 You can get 12-24 TB in a server today, which means the loss of a server 
 generates a lot of traffic -which argues for 10 Gbe.
 
 But
 -big increase in switch cost, especially if you (CoI warning) go with Cisco
 -there have been problems with things like BIOS PXE and lights out management 
 on 10 Gbe -probably due to the NICs being things the BIOS wasn't expecting 
 and off the mainboard. This should improve.
 -I don't know how well linux works with ether that fast (field reports useful)
 -the big threat is still ToR switch failure, as that will trigger a 
 re-replication of every block in the rack.
 
 2x1 Gbe lets you have redundant switches, albeit at the price of more wiring, 
 more things to go wrong with the wiring, etc.
 
 The other thing to consider is how well the enterprise switches work in 
 this world -with a Hadoop cluster you can really test those claims how well 
 the switches handle every port lighting up at full rate. Indeed, I recommend 
 that as part of your acceptance tests for the switch.
 
 
 


Re: Hadoop Java Versions

2011-06-28 Thread Arun C Murthy
We at Yahoo are about to deploy code to ensure a disk failure on a datanode is 
just that - a disk failure. Not a node failure. This really helps avoid 
replication storms.

It's in the 0.20.204 branch for the curious.

Arun

Sent from my iPhone

On Jun 28, 2011, at 3:01 AM, Steve Loughran ste...@apache.org wrote:

 On 28/06/11 04:49, Segel, Mike wrote:
 Hmmm. I could have sworn there was a background balancing bandwidth limiter.
 
 There is, for the rebalancer, node outages are taken more seriously, 
 though there have been problems in past 0.20.x where there was a risk of 
 a cascade failure on a big switch/rack failure. The risk has been 
 reduced, though we all await field reports to confirm this :)
 
 You can get 12-24 TB in a server today, which means the loss of a server 
 generates a lot of traffic -which argues for 10 Gbe.
 
 But
  -big increase in switch cost, especially if you (CoI warning) go with 
 Cisco
  -there have been problems with things like BIOS PXE and lights out 
 management on 10 Gbe -probably due to the NICs being things the BIOS 
 wasn't expecting and off the mainboard. This should improve.
  -I don't know how well linux works with ether that fast (field reports 
 useful)
  -the big threat is still ToR switch failure, as that will trigger a 
 re-replication of every block in the rack.
 
 2x1 Gbe lets you have redundant switches, albeit at the price of more 
 wiring, more things to go wrong with the wiring, etc.
 
 The other thing to consider is how well the enterprise switches work 
 in this world -with a Hadoop cluster you can really test those claims 
 how well the switches handle every port lighting up at full rate. 
 Indeed, I recommend that as part of your acceptance tests for the switch.
 
 


Re: Hadoop Java Versions

2011-06-27 Thread Steve Loughran

On 26/06/11 20:23, Scott Carey wrote:



On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:




what's your HW setup? #cores/server, #servers, underlying OS?


CentOS 5.6.
4 cores / 8 threads a server (Nehalem generation Intel processor).



that should be enough to find problems. I've just moved up to a 6-core 
12 thread desktop and that found problems on some non-Hadoop code, which 
shows that the more threads you have, and the faster the machines are, 
the more your race conditions show up. With Hadoop the fact that you can 
have 10-1000 servers means that in a large cluster the probability of 
that race condition showing up scales well.



Also run a smaller cluster with 2x quad core Core 2 generation Xeons.

Off topic:
The single proc Nehalem is faster than the dual core 2's for most use
cases -- and much lower power.  Looking forward to single proc 4 or 6 core
Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
core has these 30% faster than the Nehalem generation systems in CPU bound
tasks and lower power.  Intel prices single socket Xeons so much lower
than the Dual socket ones that the best value for us is to get more single
socket servers rather than fewer dual socket ones (with similar processor
to hard drive ratio).


Yes, in a large cluster the price of filling the second socket can 
compare to a lot of storage, and TB of storage is more tangible. I guess 
it depends on your application.


Regarding Sandy Bridge, I've no experience of those, but I worry that 10 
Gbps is still bleeding edge, and shouldn't be needed for code with good 
locality anyway; it is probably more cost effective to stay at 
1Gbps/server, though the issue there is the #of HDD/s server generates 
lots of replication traffic when a single server fails...


Re: Hadoop Java Versions

2011-06-27 Thread Ryan Rawson
On the subject of gige vs 10-gige, I think that we will very shortly
be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
drive of streaming data.  Nodes with 4+ disks are throttled by the
network.  On a small cluster (20 nodes), the replication traffic can
choke a cluster to death.  The only way to fix quickly it is to bring
that node back up.  Perhaps the HortonWorks guys can work on that.

-ryan

On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
 On 26/06/11 20:23, Scott Carey wrote:


 On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:


 what's your HW setup? #cores/server, #servers, underlying OS?

 CentOS 5.6.
 4 cores / 8 threads a server (Nehalem generation Intel processor).


 that should be enough to find problems. I've just moved up to a 6-core 12
 thread desktop and that found problems on some non-Hadoop code, which shows
 that the more threads you have, and the faster the machines are, the more
 your race conditions show up. With Hadoop the fact that you can have 10-1000
 servers means that in a large cluster the probability of that race condition
 showing up scales well.

 Also run a smaller cluster with 2x quad core Core 2 generation Xeons.

 Off topic:
 The single proc Nehalem is faster than the dual core 2's for most use
 cases -- and much lower power.  Looking forward to single proc 4 or 6 core
 Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
 core has these 30% faster than the Nehalem generation systems in CPU bound
 tasks and lower power.  Intel prices single socket Xeons so much lower
 than the Dual socket ones that the best value for us is to get more single
 socket servers rather than fewer dual socket ones (with similar processor
 to hard drive ratio).

 Yes, in a large cluster the price of filling the second socket can compare
 to a lot of storage, and TB of storage is more tangible. I guess it depends
 on your application.

 Regarding Sandy Bridge, I've no experience of those, but I worry that 10
 Gbps is still bleeding edge, and shouldn't be needed for code with good
 locality anyway; it is probably more cost effective to stay at 1Gbps/server,
 though the issue there is the #of HDD/s server generates lots of replication
 traffic when a single server fails...



Re: Hadoop Java Versions

2011-06-27 Thread Ted Dunning
Come to Srivas talk at the Summit.

On Mon, Jun 27, 2011 at 5:10 PM, Ryan Rawson ryano...@gmail.com wrote:

 On the subject of gige vs 10-gige, I think that we will very shortly
 be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
 drive of streaming data.  Nodes with 4+ disks are throttled by the
 network.  On a small cluster (20 nodes), the replication traffic can
 choke a cluster to death.  The only way to fix quickly it is to bring
 that node back up.  Perhaps the HortonWorks guys can work on that.

 -ryan

 On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
  On 26/06/11 20:23, Scott Carey wrote:
 
 
  On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:
 
 
  what's your HW setup? #cores/server, #servers, underlying OS?
 
  CentOS 5.6.
  4 cores / 8 threads a server (Nehalem generation Intel processor).
 
 
  that should be enough to find problems. I've just moved up to a 6-core 12
  thread desktop and that found problems on some non-Hadoop code, which
 shows
  that the more threads you have, and the faster the machines are, the more
  your race conditions show up. With Hadoop the fact that you can have
 10-1000
  servers means that in a large cluster the probability of that race
 condition
  showing up scales well.
 
  Also run a smaller cluster with 2x quad core Core 2 generation Xeons.
 
  Off topic:
  The single proc Nehalem is faster than the dual core 2's for most use
  cases -- and much lower power.  Looking forward to single proc 4 or 6
 core
  Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
  core has these 30% faster than the Nehalem generation systems in CPU
 bound
  tasks and lower power.  Intel prices single socket Xeons so much lower
  than the Dual socket ones that the best value for us is to get more
 single
  socket servers rather than fewer dual socket ones (with similar
 processor
  to hard drive ratio).
 
  Yes, in a large cluster the price of filling the second socket can
 compare
  to a lot of storage, and TB of storage is more tangible. I guess it
 depends
  on your application.
 
  Regarding Sandy Bridge, I've no experience of those, but I worry that 10
  Gbps is still bleeding edge, and shouldn't be needed for code with good
  locality anyway; it is probably more cost effective to stay at
 1Gbps/server,
  though the issue there is the #of HDD/s server generates lots of
 replication
  traffic when a single server fails...
 



Re: Hadoop Java Versions

2011-06-27 Thread Segel, Mike
That doesn't seem right.
In one of our test clusters (19 data nodes) we found that under heavy loads we 
were disk I/O bound and not network bound. Of course YMMV depending on your ToR 
switch. If we had more than 4 disks per node, we would probably see the network 
being the bottleneck. What did you set your bandwidth settings in the 
hdfs-site.xml? ( going from memory not sure of the exact setting...)

But the good news... Newer hardware will start to have 10GBe on the motherboard.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote:

 On the subject of gige vs 10-gige, I think that we will very shortly
 be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
 drive of streaming data.  Nodes with 4+ disks are throttled by the
 network.  On a small cluster (20 nodes), the replication traffic can
 choke a cluster to death.  The only way to fix quickly it is to bring
 that node back up.  Perhaps the HortonWorks guys can work on that.
 
 -ryan
 
 On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
 On 26/06/11 20:23, Scott Carey wrote:
 
 
 On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:
 
 
 what's your HW setup? #cores/server, #servers, underlying OS?
 
 CentOS 5.6.
 4 cores / 8 threads a server (Nehalem generation Intel processor).
 
 
 that should be enough to find problems. I've just moved up to a 6-core 12
 thread desktop and that found problems on some non-Hadoop code, which shows
 that the more threads you have, and the faster the machines are, the more
 your race conditions show up. With Hadoop the fact that you can have 10-1000
 servers means that in a large cluster the probability of that race condition
 showing up scales well.
 
 Also run a smaller cluster with 2x quad core Core 2 generation Xeons.
 
 Off topic:
 The single proc Nehalem is faster than the dual core 2's for most use
 cases -- and much lower power.  Looking forward to single proc 4 or 6 core
 Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
 core has these 30% faster than the Nehalem generation systems in CPU bound
 tasks and lower power.  Intel prices single socket Xeons so much lower
 than the Dual socket ones that the best value for us is to get more single
 socket servers rather than fewer dual socket ones (with similar processor
 to hard drive ratio).
 
 Yes, in a large cluster the price of filling the second socket can compare
 to a lot of storage, and TB of storage is more tangible. I guess it depends
 on your application.
 
 Regarding Sandy Bridge, I've no experience of those, but I worry that 10
 Gbps is still bleeding edge, and shouldn't be needed for code with good
 locality anyway; it is probably more cost effective to stay at 1Gbps/server,
 though the issue there is the #of HDD/s server generates lots of replication
 traffic when a single server fails...
 


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.


Re: Hadoop Java Versions

2011-06-27 Thread Ryan Rawson
There are no bandwidth limitations in 0.20.x.  None that I saw at
least.  It was basically bandwidth-management-by-pwm.  You could
adjust the frequency of how many files-per-node were copied.

In my case, the load was HBase real time serving, so it was servicing
more smaller random reads, not a map-reduce. Everyone has their own
use case :-)

-ryan

On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike mse...@navteq.com wrote:
 That doesn't seem right.
 In one of our test clusters (19 data nodes) we found that under heavy loads 
 we were disk I/O bound and not network bound. Of course YMMV depending on 
 your ToR switch. If we had more than 4 disks per node, we would probably see 
 the network being the bottleneck. What did you set your bandwidth settings in 
 the hdfs-site.xml? ( going from memory not sure of the exact setting...)

 But the good news... Newer hardware will start to have 10GBe on the 
 motherboard.

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote:

 On the subject of gige vs 10-gige, I think that we will very shortly
 be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
 drive of streaming data.  Nodes with 4+ disks are throttled by the
 network.  On a small cluster (20 nodes), the replication traffic can
 choke a cluster to death.  The only way to fix quickly it is to bring
 that node back up.  Perhaps the HortonWorks guys can work on that.

 -ryan

 On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
 On 26/06/11 20:23, Scott Carey wrote:


 On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:


 what's your HW setup? #cores/server, #servers, underlying OS?

 CentOS 5.6.
 4 cores / 8 threads a server (Nehalem generation Intel processor).


 that should be enough to find problems. I've just moved up to a 6-core 12
 thread desktop and that found problems on some non-Hadoop code, which shows
 that the more threads you have, and the faster the machines are, the more
 your race conditions show up. With Hadoop the fact that you can have 10-1000
 servers means that in a large cluster the probability of that race condition
 showing up scales well.

 Also run a smaller cluster with 2x quad core Core 2 generation Xeons.

 Off topic:
 The single proc Nehalem is faster than the dual core 2's for most use
 cases -- and much lower power.  Looking forward to single proc 4 or 6 core
 Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
 core has these 30% faster than the Nehalem generation systems in CPU bound
 tasks and lower power.  Intel prices single socket Xeons so much lower
 than the Dual socket ones that the best value for us is to get more single
 socket servers rather than fewer dual socket ones (with similar processor
 to hard drive ratio).

 Yes, in a large cluster the price of filling the second socket can compare
 to a lot of storage, and TB of storage is more tangible. I guess it depends
 on your application.

 Regarding Sandy Bridge, I've no experience of those, but I worry that 10
 Gbps is still bleeding edge, and shouldn't be needed for code with good
 locality anyway; it is probably more cost effective to stay at 1Gbps/server,
 though the issue there is the #of HDD/s server generates lots of replication
 traffic when a single server fails...



 The information contained in this communication may be CONFIDENTIAL and is 
 intended only for the use of the recipient(s) named above.  If you are not 
 the intended recipient, you are hereby notified that any dissemination, 
 distribution, or copying of this communication, or any of its contents, is 
 strictly prohibited.  If you have received this communication in error, 
 please notify the sender and delete/destroy the original message and any copy 
 of it from your computer or paper files.



Re: Hadoop Java Versions

2011-06-27 Thread Scott Carey
For cost reasons, we just bonded two 1G network ports together.  A single
stream won't go past 1Gbps, but concurrent ones do -- this is with the
Linux built-in bonding.  The network is only stressed during 'sort-like'
jobs or big replication events.
We also removed some disk bottlenecks by tuning the file systems
aggressively -- using a separate partition for the M/R temp and the
location that jars may unpack into helps tremendously.  Ext4 can be
configured to delay flushing to disk for this temp space, which for small
jobs decreases the I/O tremendously as many files are deleted before they
get pushed to disk.

On 6/27/11 5:10 PM, Ryan Rawson ryano...@gmail.com wrote:

On the subject of gige vs 10-gige, I think that we will very shortly
be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
drive of streaming data.  Nodes with 4+ disks are throttled by the
network.  On a small cluster (20 nodes), the replication traffic can
choke a cluster to death.  The only way to fix quickly it is to bring
that node back up.  Perhaps the HortonWorks guys can work on that.

-ryan

On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
 On 26/06/11 20:23, Scott Carey wrote:


 On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:


 what's your HW setup? #cores/server, #servers, underlying OS?

 CentOS 5.6.
 4 cores / 8 threads a server (Nehalem generation Intel processor).


 that should be enough to find problems. I've just moved up to a 6-core
12
 thread desktop and that found problems on some non-Hadoop code, which
shows
 that the more threads you have, and the faster the machines are, the
more
 your race conditions show up. With Hadoop the fact that you can have
10-1000
 servers means that in a large cluster the probability of that race
condition
 showing up scales well.

 Also run a smaller cluster with 2x quad core Core 2 generation Xeons.

 Off topic:
 The single proc Nehalem is faster than the dual core 2's for most use
 cases -- and much lower power.  Looking forward to single proc 4 or 6
core
 Sandy Bridge based systems for the next expansion -- testing 4 core vs
4
 core has these 30% faster than the Nehalem generation systems in CPU
bound
 tasks and lower power.  Intel prices single socket Xeons so much lower
 than the Dual socket ones that the best value for us is to get more
single
 socket servers rather than fewer dual socket ones (with similar
processor
 to hard drive ratio).

 Yes, in a large cluster the price of filling the second socket can
compare
 to a lot of storage, and TB of storage is more tangible. I guess it
depends
 on your application.

 Regarding Sandy Bridge, I've no experience of those, but I worry that 10
 Gbps is still bleeding edge, and shouldn't be needed for code with good
 locality anyway; it is probably more cost effective to stay at
1Gbps/server,
 though the issue there is the #of HDD/s server generates lots of
replication
 traffic when a single server fails...




Re: Hadoop Java Versions

2011-06-27 Thread Segel, Mike
Hmmm. I could have sworn there was a background balancing bandwidth limiter.

Haven't tested random reads... The last test we did ended up hitting the cache, 
but we didn't push it hard enough to hit network bandwidth limitations... Not 
to say they don't exist. Like I said in the other post, if we had more disks... 
we would hit it. 

We'll have to do more random testing.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jun 27, 2011, at 9:34 PM, Ryan Rawson ryano...@gmail.com wrote:

 There are no bandwidth limitations in 0.20.x.  None that I saw at
 least.  It was basically bandwidth-management-by-pwm.  You could
 adjust the frequency of how many files-per-node were copied.
 
 In my case, the load was HBase real time serving, so it was servicing
 more smaller random reads, not a map-reduce. Everyone has their own
 use case :-)
 
 -ryan
 
 On Mon, Jun 27, 2011 at 6:54 PM, Segel, Mike mse...@navteq.com wrote:
 That doesn't seem right.
 In one of our test clusters (19 data nodes) we found that under heavy loads 
 we were disk I/O bound and not network bound. Of course YMMV depending on 
 your ToR switch. If we had more than 4 disks per node, we would probably see 
 the network being the bottleneck. What did you set your bandwidth settings 
 in the hdfs-site.xml? ( going from memory not sure of the exact setting...)
 
 But the good news... Newer hardware will start to have 10GBe on the 
 motherboard.
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On Jun 27, 2011, at 7:11 PM, Ryan Rawson ryano...@gmail.com wrote:
 
 On the subject of gige vs 10-gige, I think that we will very shortly
 be seeing interest in 10gig, since gige is only 120MB/sec - 1 hard
 drive of streaming data.  Nodes with 4+ disks are throttled by the
 network.  On a small cluster (20 nodes), the replication traffic can
 choke a cluster to death.  The only way to fix quickly it is to bring
 that node back up.  Perhaps the HortonWorks guys can work on that.
 
 -ryan
 
 On Mon, Jun 27, 2011 at 4:38 AM, Steve Loughran ste...@apache.org wrote:
 On 26/06/11 20:23, Scott Carey wrote:
 
 
 On 6/23/11 5:49 AM, Steve Loughranste...@apache.org  wrote:
 
 
 what's your HW setup? #cores/server, #servers, underlying OS?
 
 CentOS 5.6.
 4 cores / 8 threads a server (Nehalem generation Intel processor).
 
 
 that should be enough to find problems. I've just moved up to a 6-core 12
 thread desktop and that found problems on some non-Hadoop code, which shows
 that the more threads you have, and the faster the machines are, the more
 your race conditions show up. With Hadoop the fact that you can have 
 10-1000
 servers means that in a large cluster the probability of that race 
 condition
 showing up scales well.
 
 Also run a smaller cluster with 2x quad core Core 2 generation Xeons.
 
 Off topic:
 The single proc Nehalem is faster than the dual core 2's for most use
 cases -- and much lower power.  Looking forward to single proc 4 or 6 core
 Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
 core has these 30% faster than the Nehalem generation systems in CPU bound
 tasks and lower power.  Intel prices single socket Xeons so much lower
 than the Dual socket ones that the best value for us is to get more single
 socket servers rather than fewer dual socket ones (with similar processor
 to hard drive ratio).
 
 Yes, in a large cluster the price of filling the second socket can compare
 to a lot of storage, and TB of storage is more tangible. I guess it depends
 on your application.
 
 Regarding Sandy Bridge, I've no experience of those, but I worry that 10
 Gbps is still bleeding edge, and shouldn't be needed for code with good
 locality anyway; it is probably more cost effective to stay at 
 1Gbps/server,
 though the issue there is the #of HDD/s server generates lots of 
 replication
 traffic when a single server fails...
 
 
 
 The information contained in this communication may be CONFIDENTIAL and is 
 intended only for the use of the recipient(s) named above.  If you are not 
 the intended recipient, you are hereby notified that any dissemination, 
 distribution, or copying of this communication, or any of its contents, is 
 strictly prohibited.  If you have received this communication in error, 
 please notify the sender and delete/destroy the original message and any 
 copy of it from your computer or paper files.
 


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.


Re: Hadoop Java Versions

2011-06-26 Thread Scott Carey


On 6/23/11 5:49 AM, Steve Loughran ste...@apache.org wrote:

On 22/06/2011 21:27, Scott Carey wrote:
 Problems have been reported with Hadoop, the 64-bit JVM and Compressed
 Object References (the -XX:+UseCompressedOops option), so use of that
 option is discouraged.

 I think the above is dated.  It also lacks critical information. What
JVM
 and OS version was the problem seen?

A colleague saw it, intermittent JVM crashes. Unless he's updated I
can't say the problem has gone away.


 CompressedOops had several issues prior to Jre 6u20, and a few minor
ones
 were fixed in u21.  FWIW, I now exclusively use 64 bit w/ CompressedOops
 for all Hadoop and non-Hadoop apps and have seen no issues.  It is the
 default in 6u24 and 6u25 on a 64 bit  JVM.


what's your HW setup? #cores/server, #servers, underlying OS?

CentOS 5.6.  
4 cores / 8 threads a server (Nehalem generation Intel processor).

Also run a smaller cluster with 2x quad core Core 2 generation Xeons.

Off topic:
The single proc Nehalem is faster than the dual core 2's for most use
cases -- and much lower power.  Looking forward to single proc 4 or 6 core
Sandy Bridge based systems for the next expansion -- testing 4 core vs 4
core has these 30% faster than the Nehalem generation systems in CPU bound
tasks and lower power.  Intel prices single socket Xeons so much lower
than the Dual socket ones that the best value for us is to get more single
socket servers rather than fewer dual socket ones (with similar processor
to hard drive ratio).  We are power constrained per rack either way.



Re: Hadoop Java Versions

2011-06-23 Thread Steve Loughran

On 22/06/2011 21:27, Scott Carey wrote:

Problems have been reported with Hadoop, the 64-bit JVM and Compressed
Object References (the -XX:+UseCompressedOops option), so use of that
option is discouraged.

I think the above is dated.  It also lacks critical information. What JVM
and OS version was the problem seen?


A colleague saw it, intermittent JVM crashes. Unless he's updated I 
can't say the problem has gone away.




CompressedOops had several issues prior to Jre 6u20, and a few minor ones
were fixed in u21.  FWIW, I now exclusively use 64 bit w/ CompressedOops
for all Hadoop and non-Hadoop apps and have seen no issues.  It is the
default in 6u24 and 6u25 on a 64 bit  JVM.



what's your HW setup? #cores/server, #servers, underlying OS?


Re: Hadoop Java Versions

2011-06-22 Thread Scott Carey
Problems have been reported with Hadoop, the 64-bit JVM and Compressed
Object References (the -XX:+UseCompressedOops option), so use of that
option is discouraged.

I think the above is dated.  It also lacks critical information. What JVM
and OS version was the problem seen?

CompressedOops had several issues prior to Jre 6u20, and a few minor ones
were fixed in u21.  FWIW, I now exclusively use 64 bit w/ CompressedOops
for all Hadoop and non-Hadoop apps and have seen no issues.  It is the
default in 6u24 and 6u25 on a 64 bit JVM.

On 6/14/11 5:16 PM, Allen Wittenauer a...@apache.org wrote:


While we're looking at the wiki, could folks update
http://wiki.apache.org/hadoop/HadoopJavaVersions with whatever versions
of Hadoop they are using successfully?

Thanks.

P.S., yes, I'm thinking about upgrading ours. :p



Re: Hadoop Java Versions

2011-06-22 Thread Allen Wittenauer

On Jun 22, 2011, at 1:27 PM, Scott Carey wrote:

 Problems have been reported with Hadoop, the 64-bit JVM and Compressed
 Object References (the -XX:+UseCompressedOops option), so use of that
 option is discouraged.
 
 I think the above is dated.  It also lacks critical information. What JVM
 and OS version was the problem seen?

That is definitely dated.  Those were problems from 18 and 19.

 
 CompressedOops had several issues prior to Jre 6u20, and a few minor ones
 were fixed in u21.  FWIW, I now exclusively use 64 bit w/ CompressedOops
 for all Hadoop and non-Hadoop apps and have seen no issues.  It is the
 default in 6u24 and 6u25 on a 64 bit JVM.

We use compressedoops on 21 every-so-often as well.  

Feel free to edit that whole section. :)

Thanks!



Re: Hadoop Java Versions

2011-06-22 Thread Scott Carey


On 6/22/11 1:49 PM, Allen Wittenauer a...@apache.org wrote:


On Jun 22, 2011, at 1:27 PM, Scott Carey wrote:

 Problems have been reported with Hadoop, the 64-bit JVM and Compressed
 Object References (the -XX:+UseCompressedOops option), so use of that
 option is discouraged.
 
 I think the above is dated.  It also lacks critical information. What
JVM
 and OS version was the problem seen?

That is definitely dated.  Those were problems from 18 and 19.

 
 CompressedOops had several issues prior to Jre 6u20, and a few minor
ones
 were fixed in u21.  FWIW, I now exclusively use 64 bit w/ CompressedOops
 for all Hadoop and non-Hadoop apps and have seen no issues.  It is the
 default in 6u24 and 6u25 on a 64 bit JVM.

   We use compressedoops on 21 every-so-often as well.

   Feel free to edit that whole section. :)

   Thanks!

Updated the CompressedOops section.

I am using 6u25 in production and have used 6u24 as well.  (64 bit
version, CentOS 5.5)
We usually roll out JVM's to a subset of nodes, then after a little
burn-in to the whole thing.  First on development / qa clusters (which are
just as busy as production, but smaller), then production.