Re: [R-sig-phylo] read.dna warnings and pitfalls

Emmanuel Paradis Thu, 26 Apr 2012 02:17:42 -0700

Hi Dan,

The reason for this implementation (searching the first 10 IUPAC-coded bases) 
is because the exact formatting is not inconsistent among different programs. 
Some files have:


0123456789acgt.....

that is a 10-character name and the sequence starting on the 11th position. I 
think this is typical for Phylip. Other software (e.g., PhyML) accepts longer 
taxa names and require a space before the start of the sequence.

About your example: it depends on the order of the data. The following file can 
be read:

2 10
xxxxx     AAAAAAAAAA
madagascarAAAAAAAAAA

But if you invert the two sequence lines, it fails.

It is the first time I hear about this problem in 9 years, maybe because it 
requires a particular combination of circumstances. Another drawback of this 
implementation is that files with less than 10 bases cannot be read.

How to solve this? If it were left only to me, I would deprecate the 
interleaved and sequential formats. FASTA is more flexible, more widespread, 
easier to parse, can store exactly the same information, and labels are only 
constrained to be on a single line (but can contain any characters including 
\n, \t, ...) But I guess many programs use the Phylip formats, so I'd be glad 
to read other suggestions.

As for your 2nd problem, it is now fixed in ape.

Best,

Emmanuel
-----Original Message-----
From: Dan Rabosky <drabo...@umich.edu>
Sender: r-sig-phylo-boun...@r-project.org
Date: Wed, 25 Apr 2012 17:51:35 
To: <r-sig-phylo@r-project.org>
Subject: [R-sig-phylo] read.dna warnings and pitfalls


Hi All-

I have spent an inordinate and embarrassing amount of time tracking down an 
excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
key problems:

1) The function automatically assumes it is reading DNA sequences when it 
encounters a string of 10 continuous "DNA-like" characters. This includes all 
characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the phylip 
original, does not have limits on taxon name lengths. Hence, I had - in the 
middle of a large alignment - a species whose name included the string 
"MADAGASCAR", which caused a failure.  To be fair, the documentation warns of 
this, but I think this is extremely easy to overlook, and - moreover - it seems 
unfortunate to have to parse all your taxon names for a potential IUPAC match 
before trying to use the function. Presumably, most users who specify 
sequential spacing will be using whitespace to separate taxon names from DNA 
sequences, and perhaps it is better to exploit this rather than IUPAC matching. 

2) The function is whitespace-sensitive. if you tab-separate the numbers on the 
first line (numbers of taxa, numbers of sites), you'll receive an errror with 
the message: "the first line of the file must contain the dimensions of the 
data". It appears that spaces are OK, however. 

Hopefully this post will be useful to somewhere in the future with a similar 
issue. Perhaps these can be addressed in a future update to ape? 

-Dan Rabosky

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] read.dna warnings and pitfalls

Reply via email to