[R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin

2011-08-01 Thread Alastair Potts

Good day all,

The estimate of nucleotide diversity vary hugely between pegas and 
Arlequin, yet they both cite Nei (1987).


For example,
Arlequin = 0.7158
pegas = 0.003926426
With the same dataset.

Their definitions of nucleotide diversity are also very different:

Arlequin: It is computed here as the probability that two randomly 
chosen homologous (nucleotide or RFLP) sites are different.


pegas: The nucleotide diversity is the sum of the number of differences 
between pairs of sequences divided by the number of comparisons (i.e. 
/n(n - 1)/2/, where /n/ is the number of sequences). 


Now, which is correct? Or more correct? How are they linked? Which 
should I use in a paper?


Thanks very much for any help,

Confused Al

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin

2011-08-01 Thread Emmanuel Paradis

Hi Alastair,

Alastair Potts wrote on 01/08/2011 14:19:

Good day all,

The estimate of nucleotide diversity vary hugely between pegas and 
Arlequin, yet they both cite Nei (1987).


For example,
Arlequin = 0.7158
pegas = 0.003926426
With the same dataset.

Their definitions of nucleotide diversity are also very different:

Arlequin: It is computed here as the probability that two randomly 
chosen homologous (nucleotide or RFLP) sites are different.


pegas: The nucleotide diversity is the sum of the number of differences 
between pairs of sequences divided by the number of comparisons (i.e. 
/n(n - 1)/2/, where /n/ is the number of sequences). 


This is a way to calculate the above. Here's a function that samples 
randomly one site and two rows of a set of DNA sequences a large number 
of times and returns the proportion of cases where the two nucleotides 
were different:


f - function(x, nrep = 1e4)
{
n - nrow(x)
s - ncol(x)
count - 0
for (i in 1:nrep) {
y - x[sample(n, 2), sample(s, 1), drop = TRUE]
if (y[1] != y[2]) count - count + 1
}
count / nrep
}

It requires to have no ambiguous data, eg:

 library(phangorn)
 data(Laurasiatherian)
 X - as.DNAbin(Laurasiatherian)
 f(X)
[1] 0.1458
 nuc.div(X)
[1] 0.1448687

Can you try this on your data?

Cheers,

Emmanuel

Now, which is correct? Or more correct? How are they linked? Which 
should I use in a paper?


Thanks very much for any help,

Confused Al

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo



--
Emmanuel Paradis
IRD, Jakarta, Indonesia
http://ape.mpl.ird.fr/

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] Nucleotide diversity - different definitions and algorithms: pegas vs Arlequin

2011-08-01 Thread Alastair Potts

Hi Emmanuel,
Thanks for function. However, the problem actually lies with my 
misunderstanding of the the Arlequin output - I was mistaking the gene 
diversity result with nucleotide diversity (nucleotide diversity is the 
gene diversity divided by L loci). Both of your functions produce very 
similar results as to what is produced by Arlequin when I look up the 
correct nucleotide diversity.


My humblest apologies for wasting your time,

Cheers,
Incompetent Al


On 2011/08/01 02:59 PM, Emmanuel Paradis wrote:

Hi Alastair,

Alastair Potts wrote on 01/08/2011 14:19:

Good day all,

The estimate of nucleotide diversity vary hugely between pegas and 
Arlequin, yet they both cite Nei (1987).


For example,
Arlequin = 0.7158
pegas = 0.003926426
With the same dataset.

Their definitions of nucleotide diversity are also very different:

Arlequin: It is computed here as the probability that two randomly 
chosen homologous (nucleotide or RFLP) sites are different.


pegas: The nucleotide diversity is the sum of the number of 
differences between pairs of sequences divided by the number of 
comparisons (i.e. /n(n - 1)/2/, where /n/ is the number of sequences). 


This is a way to calculate the above. Here's a function that samples 
randomly one site and two rows of a set of DNA sequences a large 
number of times and returns the proportion of cases where the two 
nucleotides were different:


f - function(x, nrep = 1e4)
{
n - nrow(x)
s - ncol(x)
count - 0
for (i in 1:nrep) {
y - x[sample(n, 2), sample(s, 1), drop = TRUE]
if (y[1] != y[2]) count - count + 1
}
count / nrep
}

It requires to have no ambiguous data, eg:

 library(phangorn)
 data(Laurasiatherian)
 X - as.DNAbin(Laurasiatherian)
 f(X)
[1] 0.1458
 nuc.div(X)
[1] 0.1448687

Can you try this on your data?

Cheers,

Emmanuel

Now, which is correct? Or more correct? How are they linked? Which 
should I use in a paper?


Thanks very much for any help,

Confused Al

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo





___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo