Hi Emmanuel (Klaus and Joe),
The example data was meant to demonstrate that the tie-breaking in nj is
affecting the bootstrap results - or rather the lack of any way to deal
with tie breaking.
I've noticed that a bunch of identical sequences form a 'polytomy' in my
real dataset (but obviously there are still arbitrary relationships
hidden within the phylogram) and that within this polytomy the
relationship bootstrap supports are 100 between samples (as viewed on
the cladogram). Thus the reason why I used the example data (even though
it is flawed - as you point out).
I would expect the NJ tree of the example data to be a random tree that
is determined based on the input order of samples.
What I would also expect is that the bootstrap values would be very very
low.
These are the results that one gets from PAUP if you input the example data.
Joe said:
"When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species."
It is the problem of tie-breaking that I am focussing on.
Joe also said that:
"In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set. That solves the problem, and
is perhaps what PAUP* does too. "
I tried to replicate this using xx[sample(1:nrow(xx)),] in the following
code.
boot.phylo(tr, a,
function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100)
But the bootstrap results remain unchanged (all 100 - which is incorrect).
Looking through the boot.phylo function, I believe that this is because
the prop.part function is not checking the names of the tips - thus it
thinks the input order remains constant (and the same 'random' tree is
always being generated - thus the support values of 100).
Randomising the input sample has been included in the bootstrapping for
both PAUP and PHYLIP. However, it is not implemented in the boot.phylo()
function. I think it needs to be - or am I again missing something?
Cheers,
Alastair
On 2011/05/06 09:30 AM, Emmanuel Paradis wrote:
Hi Alastair, Klaus & Joe,
Before doing the tree, you should do some preliminary data
explorations, such as:
d <- dist.dna(a)
hist(d)
summary(d)
That'd show you any tree estimation procedure (not only NJ) has very
little meaning -- just like you do plot(x, y) before doing lm(y ~ x).
Best,
Emmanuel
Alastair Potts wrote on 06/05/2011 08:45:
Hi Klaus and Joe,
Thanks very much for your responses.
From Klaus:
it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly. I added some code below
to highlight it.
>c$edge.length[] = 1
This does convert the tree into what I would expect from PAUP. But
how do I include this in the bootstrapping? Surely edge length being
set to 0 is not responsible for the observed bootstrap values of 100
for all nodes?
From Joe:
In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set. That solves the problem, and
is perhaps what PAUP* does too.
I had thought that this may be a problem, hence I did randomise the
order of the DNA data going into the bootstrap function:
boot.phylo(tr, a,
function(xx)nj(dist.dna(xx[sample(1:nrow(xx)),],model="N")),B=100)
# the sample in xx[sample(1:nrow(xx)),] should re-order the dna data
each time.
So, the 'polytomy' issue is solved. Just need to set edge.lengths.
But what about the bootstrap where we should be getting exceptionally
low bootstrap values?
Thanks very much for your time and help,
Cheers,
Alastair
Klaus Schliep wrote --
it is not that surprising. NJ normally does not produce poytomies,
just edge weights of length 0. How these are broken may depends from
the input order (from labels in the distance matrix like in this
implementation) or could be broken randomly. I added some code below
to highlight it.
When there are ties in the distance matrix, NJ and UPGMA can lead
to a situation where the ties get resolved in an arbitrary way
dependent only on the input order of species.
This can lead to excessive support for group AB in a situation
where there are three species ABC and their further resolution is
arbitrary. See Backeljau et al. MBE 1996. Farris et al. (Cladistics,
1996) pointed out that this can lead to strong support for AB that
is illusory. (I commented on this on pages 168-169 of my phylogeny
book).
In PHYLIP we have put code in that, when multiple bootstrap
replicates are being analyzed, automatically defaults to randomizing
the order of species in the data set. That solves the problem, and
is perhaps what PAUP* does too.
Joe
----
Joe Felsenstein j...@gs.washington.edu
Department of Genome Sciences and Department of Biology,
University of Washington, Box 355065, Seattle, WA 98195-5065 USA
_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo