Re: [R-sig-phylo] Bootstrap values and NJ when there is no genetic distance between samples

Alastair Potts Sat, 07 May 2011 10:08:25 -0700

Hi Emmanuel (Klaus and Joe),

The example data was meant to demonstrate that the tie-breaking in nj isaffecting the bootstrap results - or rather the lack of any way to dealwith tie breaking.I've noticed that a bunch of identical sequences form a 'polytomy' in myreal dataset (but obviously there are still arbitrary relationshipshidden within the phylogram) and that within this polytomy therelationship bootstrap supports are 100 between samples (as viewed onthe cladogram). Thus the reason why I used the example data (even thoughit is flawed - as you point out).

I would expect the NJ tree of the example data to be a random tree thatis determined based on the input order of samples.What I would also expect is that the bootstrap values would be very verylow.

These are the results that one gets from PAUP if you input the example data.

Joe said:
"When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species."

It is the problem of tie-breaking that I am focussing on.

Joe also said that:
"In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too. "

I tried to replicate this using xx[sample(1:nrow(xx)),] in the followingcode.boot.phylo(tr, a,function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100)

But the bootstrap results remain unchanged (all 100 - which is incorrect).

Looking through the boot.phylo function, I believe that this is becausethe prop.part function is not checking the names of the tips - thus itthinks the input order remains constant (and the same 'random' tree isalways being generated - thus the support values of 100).

Randomising the input sample has been included in the bootstrapping forboth PAUP and PHYLIP. However, it is not implemented in the boot.phylo()function. I think it needs to be - or am I again missing something?


Cheers,
Alastair



On 2011/05/06 09:30 AM, Emmanuel Paradis wrote:

Hi Alastair, Klaus & Joe,

Before doing the tree, you should do some preliminary dataexplorations, such as:


d <- dist.dna(a)
hist(d)
summary(d)

That'd show you any tree estimation procedure (not only NJ) has verylittle meaning -- just like you do plot(x, y) before doing lm(y ~ x).


Best,

Emmanuel

Alastair Potts wrote on 06/05/2011 08:45:

Hi Klaus and Joe,

Thanks very much for your responses.

 From Klaus:

it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly.  I added some code below
to highlight it.


>c$edge.length[] = 1

This does convert the tree into what I would expect from PAUP. Buthow do I include this in the bootstrapping? Surely edge length beingset to 0 is not responsible for the observed bootstrap values of 100for all nodes?


 From Joe:

In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too.

I had thought that this may be a problem, hence I did randomise theorder of the DNA data going into the bootstrap function:

boot.phylo(tr, a,function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100)# the sample in xx[sample(1:nrow(xx)),] should re-order the dna dataeach time.

So, the 'polytomy' issue is solved. Just need to set edge.lengths.But what about the bootstrap where we should be getting exceptionallylow bootstrap values?


Thanks very much for your time and help,
Cheers,
Alastair

Klaus Schliep wrote --

it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly.  I added some code below
to highlight it.

When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species.

This can lead to excessive support for group AB in a situation
where there are three species ABC and their further resolution is
arbitrary.   See Backeljau et al. MBE 1996.  Farris et al. (Cladistics,
1996) pointed out that this can lead to strong support for AB that
is illusory.  (I commented on this on pages 168-169 of my phylogeny
book).

In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too.

Joe
----
Joe Felsenstein         j...@gs.washington.edu
  Department of Genome Sciences and Department of Biology,
  University of Washington, Box 355065, Seattle, WA 98195-5065 USA


_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] Bootstrap values and NJ when there is no genetic distance between samples

Reply via email to