Hi Emmanuel (Klaus and Joe),

The example data was meant to demonstrate that the tie-breaking in nj is affecting the bootstrap results - or rather the lack of any way to deal with tie breaking. I've noticed that a bunch of identical sequences form a 'polytomy' in my real dataset (but obviously there are still arbitrary relationships hidden within the phylogram) and that within this polytomy the relationship bootstrap supports are 100 between samples (as viewed on the cladogram). Thus the reason why I used the example data (even though it is flawed - as you point out).

I would expect the NJ tree of the example data to be a random tree that is determined based on the input order of samples. What I would also expect is that the bootstrap values would be very very low.
These are the results that one gets from PAUP if you input the example data.

Joe said:
"When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species."

It is the problem of tie-breaking that I am focussing on.

Joe also said that:
"In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too. "

I tried to replicate this using xx[sample(1:nrow(xx)),] in the following code. boot.phylo(tr, a, function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100)
But the bootstrap results remain unchanged (all 100 - which is incorrect).

Looking through the boot.phylo function, I believe that this is because the prop.part function is not checking the names of the tips - thus it thinks the input order remains constant (and the same 'random' tree is always being generated - thus the support values of 100).

Randomising the input sample has been included in the bootstrapping for both PAUP and PHYLIP. However, it is not implemented in the boot.phylo() function. I think it needs to be - or am I again missing something?

Cheers,
Alastair



On 2011/05/06 09:30 AM, Emmanuel Paradis wrote:
Hi Alastair, Klaus & Joe,

Before doing the tree, you should do some preliminary data explorations, such as:

d <- dist.dna(a)
hist(d)
summary(d)

That'd show you any tree estimation procedure (not only NJ) has very little meaning -- just like you do plot(x, y) before doing lm(y ~ x).

Best,

Emmanuel

Alastair Potts wrote on 06/05/2011 08:45:
Hi Klaus and Joe,

Thanks very much for your responses.

 From Klaus:
it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly.  I added some code below
to highlight it.

>c$edge.length[] = 1

This does convert the tree into what I would expect from PAUP. But how do I include this in the bootstrapping? Surely edge length being set to 0 is not responsible for the observed bootstrap values of 100 for all nodes?

 From Joe:

In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too.

I had thought that this may be a problem, hence I did randomise the order of the DNA data going into the bootstrap function:

boot.phylo(tr, a, function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100) # the sample in xx[sample(1:nrow(xx)),] should re-order the dna data each time.

So, the 'polytomy' issue is solved. Just need to set edge.lengths. But what about the bootstrap where we should be getting exceptionally low bootstrap values?

Thanks very much for your time and help,
Cheers,
Alastair


Klaus Schliep wrote --

it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly.  I added some code below
to highlight it.
When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species.

This can lead to excessive support for group AB in a situation
where there are three species ABC and their further resolution is
arbitrary.   See Backeljau et al. MBE 1996.  Farris et al. (Cladistics,
1996) pointed out that this can lead to strong support for AB that
is illusory.  (I commented on this on pages 168-169 of my phylogeny
book).

In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set.  That solves the problem, and
is perhaps what PAUP* does too.

Joe
----
Joe Felsenstein         j...@gs.washington.edu
  Department of Genome Sciences and Department of Biology,
  University of Washington, Box 355065, Seattle, WA 98195-5065 USA


_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo



_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Reply via email to