[R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin
Good day all, The estimate of nucleotide diversity vary hugely between pegas and Arlequin, yet they both cite Nei (1987). For example, Arlequin = 0.7158 pegas = 0.003926426 With the same dataset. Their definitions of nucleotide diversity are also very different: Arlequin: It is computed here as the probability that two randomly chosen homologous (nucleotide or RFLP) sites are different. pegas: The nucleotide diversity is the sum of the number of differences between pairs of sequences divided by the number of comparisons (i.e. /n(n - 1)/2/, where /n/ is the number of sequences). Now, which is correct? Or more correct? How are they linked? Which should I use in a paper? Thanks very much for any help, Confused Al ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin
Hi Alastair, Alastair Potts wrote on 01/08/2011 14:19: Good day all, The estimate of nucleotide diversity vary hugely between pegas and Arlequin, yet they both cite Nei (1987). For example, Arlequin = 0.7158 pegas = 0.003926426 With the same dataset. Their definitions of nucleotide diversity are also very different: Arlequin: It is computed here as the probability that two randomly chosen homologous (nucleotide or RFLP) sites are different. pegas: The nucleotide diversity is the sum of the number of differences between pairs of sequences divided by the number of comparisons (i.e. /n(n - 1)/2/, where /n/ is the number of sequences). This is a way to calculate the above. Here's a function that samples randomly one site and two rows of a set of DNA sequences a large number of times and returns the proportion of cases where the two nucleotides were different: f - function(x, nrep = 1e4) { n - nrow(x) s - ncol(x) count - 0 for (i in 1:nrep) { y - x[sample(n, 2), sample(s, 1), drop = TRUE] if (y[1] != y[2]) count - count + 1 } count / nrep } It requires to have no ambiguous data, eg: library(phangorn) data(Laurasiatherian) X - as.DNAbin(Laurasiatherian) f(X) [1] 0.1458 nuc.div(X) [1] 0.1448687 Can you try this on your data? Cheers, Emmanuel Now, which is correct? Or more correct? How are they linked? Which should I use in a paper? Thanks very much for any help, Confused Al ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo -- Emmanuel Paradis IRD, Jakarta, Indonesia http://ape.mpl.ird.fr/ ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin
Hi Emmanuel, Thanks for function. However, the problem actually lies with my misunderstanding of the the Arlequin output - I was mistaking the gene diversity result with nucleotide diversity (nucleotide diversity is the gene diversity divided by L loci). Both of your functions produce very similar results as to what is produced by Arlequin when I look up the correct nucleotide diversity. My humblest apologies for wasting your time, Cheers, Incompetent Al On 2011/08/01 02:59 PM, Emmanuel Paradis wrote: Hi Alastair, Alastair Potts wrote on 01/08/2011 14:19: Good day all, The estimate of nucleotide diversity vary hugely between pegas and Arlequin, yet they both cite Nei (1987). For example, Arlequin = 0.7158 pegas = 0.003926426 With the same dataset. Their definitions of nucleotide diversity are also very different: Arlequin: It is computed here as the probability that two randomly chosen homologous (nucleotide or RFLP) sites are different. pegas: The nucleotide diversity is the sum of the number of differences between pairs of sequences divided by the number of comparisons (i.e. /n(n - 1)/2/, where /n/ is the number of sequences). This is a way to calculate the above. Here's a function that samples randomly one site and two rows of a set of DNA sequences a large number of times and returns the proportion of cases where the two nucleotides were different: f - function(x, nrep = 1e4) { n - nrow(x) s - ncol(x) count - 0 for (i in 1:nrep) { y - x[sample(n, 2), sample(s, 1), drop = TRUE] if (y[1] != y[2]) count - count + 1 } count / nrep } It requires to have no ambiguous data, eg: library(phangorn) data(Laurasiatherian) X - as.DNAbin(Laurasiatherian) f(X) [1] 0.1458 nuc.div(X) [1] 0.1448687 Can you try this on your data? Cheers, Emmanuel Now, which is correct? Or more correct? How are they linked? Which should I use in a paper? Thanks very much for any help, Confused Al ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo