Re: [R-sig-phylo] read.dna warnings and pitfalls

2012-07-12 Thread Andrés Parada
Hi all,

I obtained a using a vector with accession numbers and read.GenBAnk
(namefile) The names are there since I can obtain a list via attr
(namefile, species) I couldn't find the way to use write.dna to save a
fasta file with those species labels instead of accession numbers.

I noticed seq.names is no more used under ape.
*Could you tell me how to save a fasta with species names as labels?*
Thanks in advance,

a

2012/5/3 Emmanuel Paradis emmanuel.para...@ird.fr

 I made some changes in read.dna which, I hope, solve the problems. The
 taxa names can be of any length and must be separated from the sequences by
 at least one space (or tabulation). write.dna() now follows the same rule.
 Files with less than 10 nucleotides can now be read by read.dna (bug fixed).

 I removed the option 'seq.names' of read.dna since it doesn't seem
 particularly useful and this helped to clarify the code.

 The new versions are now on ape's SVN:

 https://svn.mpl.ird.fr/ape/**dev/ape/R/read.dna.Rhttps://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R
 https://svn.mpl.ird.fr/ape/**dev/ape/R/write.dna.Rhttps://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R

 Tests welcome!


 Best,

 Emmanuel

 Dan Rabosky wrote on 26/04/2012 22:01:


 Hi Emmanuel-

 Thanks for fixing the whitespace issue. I think this fix will be useful
 to many users.

 On the issue of recognizing 10 IUPAC characters: I think this is a real
 problem, and may come up again in short order. Maybe it is just that use of
 this function has been limited? In the single dataset with a modest number
 of sequences that caused me problems yesterday, I had the following species
 and/or genus names - all of which constitute 10 character strings drawn
 from the set of IUPAC codes:

 brachyurus (x 2)
 savannarum
 graduacauda
 caudacutus
 Camarhynchus (x 3)
 madagascariensis

 I don't suggest deprecating the phylip sequential, but rather, using
 something that is compatible with raxml (surely one of the most widely used
 phylogenetics programs today). I think raxml uses a relaxed sequential
 version of the phylip format with whitespace delimitation. I could read the
 same alignment in raxml with no problems, but I had multiple issues when
 reading the same file with read.dna (including the whitespace character on
 the first line). My guess is that very few people are using the original
 phylip format, with its limit of 10 characters per taxon name, and with dna
 seqs beginning immediately after this. So maybe deprecate sequential
 phylip, but you could use what Stamatakis calls relaxed sequential
 PHYLIP, which appears to be: (1) taxon names cannot include spaces but can
 be up to 100 characters; and (2) names separated from sequences by
 whitespace character (ideally, this should recognize any number of spaces
 or tabs to prevent user confusion).

 For users with tab-delimited raxml files (eg each taxon name separated
 from its dna sequence by a tab), you can use a regular-expressions enabled
 text editor (like textwrangler) to quickly find potential problems. Just
 search for

 [ACGTUMRWSYKVHDBN]{10}.+\t

 with grep matching enabled.

 Cheers,
 ~Dan


 On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:

  Hi Dan,

 The reason for this implementation (searching the first 10 IUPAC-coded
 bases) is because the exact formatting is not inconsistent among different
 programs. Some files have:

 0123456789acgt.

 that is a 10-character name and the sequence starting on the 11th
 position. I think this is typical for Phylip. Other software (e.g., PhyML)
 accepts longer taxa names and require a space before the start of the
 sequence.

 About your example: it depends on the order of the data. The following
 file can be read:

 2 10
 x AA
 madagascarAA

 But if you invert the two sequence lines, it fails.

 It is the first time I hear about this problem in 9 years, maybe because
 it requires a particular combination of circumstances. Another drawback of
 this implementation is that files with less than 10 bases cannot be read.

 How to solve this? If it were left only to me, I would deprecate the
 interleaved and sequential formats. FASTA is more flexible, more
 widespread, easier to parse, can store exactly the same information, and
 labels are only constrained to be on a single line (but can contain any
 characters including \n, \t, ...) But I guess many programs use the Phylip
 formats, so I'd be glad to read other suggestions.

 As for your 2nd problem, it is now fixed in ape.

 Best,

 Emmanuel
 -Original Message-
 From: Dan Raboskydrabo...@umich.edu
 Sender: 
 r-sig-phylo-bounces@r-project.**orgr-sig-phylo-boun...@r-project.org
 Date: Wed, 25 Apr 2012 17:51:35
 To:r-sig-phylo@r-project.org
 Subject: [R-sig-phylo] read.dna warnings and pitfalls


 Hi All-

 I have spent an inordinate and embarrassing amount of time tracking down
 an excruciatingly cryptic issue with read.dna, which I rarely use. Here are
 two key problems:

 1) The function

Re: [R-sig-phylo] read.dna warnings and pitfalls

2012-05-03 Thread Emmanuel Paradis
I made some changes in read.dna which, I hope, solve the problems. The 
taxa names can be of any length and must be separated from the sequences 
by at least one space (or tabulation). write.dna() now follows the same 
rule. Files with less than 10 nucleotides can now be read by read.dna 
(bug fixed).


I removed the option 'seq.names' of read.dna since it doesn't seem 
particularly useful and this helped to clarify the code.


The new versions are now on ape's SVN:

https://svn.mpl.ird.fr/ape/dev/ape/R/read.dna.R
https://svn.mpl.ird.fr/ape/dev/ape/R/write.dna.R

Tests welcome!

Best,

Emmanuel

Dan Rabosky wrote on 26/04/2012 22:01:


Hi Emmanuel-

Thanks for fixing the whitespace issue. I think this fix will be useful to many 
users.

On the issue of recognizing 10 IUPAC characters: I think this is a real 
problem, and may come up again in short order. Maybe it is just that use of 
this function has been limited? In the single dataset with a modest number of 
sequences that caused me problems yesterday, I had the following species and/or 
genus names - all of which constitute 10 character strings drawn from the set 
of IUPAC codes:

brachyurus (x 2)
savannarum
graduacauda
caudacutus
Camarhynchus (x 3)
madagascariensis

I don't suggest deprecating the phylip sequential, but rather, using something that is compatible 
with raxml (surely one of the most widely used phylogenetics programs today). I think raxml uses a 
relaxed sequential version of the phylip format with whitespace delimitation. I could read the same 
alignment in raxml with no problems, but I had multiple issues when reading the same file with 
read.dna (including the whitespace character on the first line). My guess is that very few people 
are using the original phylip format, with its limit of 10 characters per taxon name, and with dna 
seqs beginning immediately after this. So maybe deprecate sequential phylip, but you 
could use what Stamatakis calls relaxed sequential PHYLIP, which appears to be: (1) 
taxon names cannot include spaces but can be up to 100 characters; and (2) names separated from 
sequences by whitespace character (ideally, this should recognize any number of spaces or tabs to 
prevent user confusion).

For users with tab-delimited raxml files (eg each taxon name separated from its 
dna sequence by a tab), you can use a regular-expressions enabled text editor 
(like textwrangler) to quickly find potential problems. Just search for

[ACGTUMRWSYKVHDBN]{10}.+\t

with grep matching enabled.

Cheers,
~Dan


On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:


Hi Dan,

The reason for this implementation (searching the first 10 IUPAC-coded bases) 
is because the exact formatting is not inconsistent among different programs. 
Some files have:

0123456789acgt.

that is a 10-character name and the sequence starting on the 11th position. I 
think this is typical for Phylip. Other software (e.g., PhyML) accepts longer 
taxa names and require a space before the start of the sequence.

About your example: it depends on the order of the data. The following file can 
be read:

2 10
x AA
madagascarAA

But if you invert the two sequence lines, it fails.

It is the first time I hear about this problem in 9 years, maybe because it 
requires a particular combination of circumstances. Another drawback of this 
implementation is that files with less than 10 bases cannot be read.

How to solve this? If it were left only to me, I would deprecate the 
interleaved and sequential formats. FASTA is more flexible, more widespread, 
easier to parse, can store exactly the same information, and labels are only 
constrained to be on a single line (but can contain any characters including 
\n, \t, ...) But I guess many programs use the Phylip formats, so I'd be glad 
to read other suggestions.

As for your 2nd problem, it is now fixed in ape.

Best,

Emmanuel
-Original Message-
From: Dan Raboskydrabo...@umich.edu
Sender: r-sig-phylo-boun...@r-project.org
Date: Wed, 25 Apr 2012 17:51:35
To:r-sig-phylo@r-project.org
Subject: [R-sig-phylo] read.dna warnings and pitfalls


Hi All-

I have spent an inordinate and embarrassing amount of time tracking down an 
excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
key problems:

1) The function automatically assumes it is reading DNA sequences when it encounters a string of 10 
continuous DNA-like characters. This includes all characters in the set 
(ACGTUMRWSYKVHDBN-). This function, unlike the phylip original, does not have limits on taxon name 
lengths. Hence, I had - in the middle of a large alignment - a species whose name included the 
string MADAGASCAR, which caused a failure.  To be fair, the documentation warns of 
this, but I think this is extremely easy to overlook, and - moreover - it seems unfortunate to have 
to parse all your taxon names for a potential IUPAC match before trying to use the function. 
Presumably, most users who specify

Re: [R-sig-phylo] read.dna warnings and pitfalls

2012-05-01 Thread Nick Matzke
...@umich.edu
 Sender: r-sig-phylo-boun...@r-project.org
 Date: Wed, 25 Apr 2012 17:51:35
 To:r-sig-phylo@r-project.org
 Subject: [R-sig-phylo] read.dna warnings and pitfalls


 Hi All-

 I have spent an inordinate and embarrassing amount of time tracking down
 an excruciatingly cryptic issue with read.dna, which I rarely use. Here are
 two key problems:

 1) The function automatically assumes it is reading DNA sequences when it
 encounters a string of 10 continuous DNA-like characters. This includes
 all characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the
 phylip original, does not have limits on taxon name lengths. Hence, I had -
 in the middle of a large alignment - a species whose name included the
 string MADAGASCAR, which caused a failure.  To be fair, the documentation
 warns of this, but I think this is extremely easy to overlook, and -
 moreover - it seems unfortunate to have to parse all your taxon names for a
 potential IUPAC match before trying to use the function. Presumably, most
 users who specify sequential spacing will be using whitespace to separate
 taxon names from DNA sequences, and perhaps it is better to exploit this
 rather than IUPAC matching.

 2) The function is whitespace-sensitive. if you tab-separate the numbers
 on the first line (numbers of taxa, numbers of sites), you'll receive an
 errror with the message: the first line of the file must contain the
 dimensions of the data. It appears that spaces are OK, however.

 Hopefully this post will be useful to somewhere in the future with a
 similar issue. Perhaps these can be addressed in a future update to ape?

 -Dan Rabosky

 ___
 R-sig-phylo mailing list
 R-sig-phylo@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
 ___
 R-sig-phylo mailing list
 R-sig-phylo@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-phylo









 --
 Emmanuel Paradis
 IRD, Jakarta, Indonesia
 http://ape.mpl.ird.fr/


 ___
 R-sig-phylo mailing list
 R-sig-phylo@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] read.dna warnings and pitfalls

2012-04-26 Thread Emmanuel Paradis
Hi Dan,

The reason for this implementation (searching the first 10 IUPAC-coded bases) 
is because the exact formatting is not inconsistent among different programs. 
Some files have:

0123456789acgt.

that is a 10-character name and the sequence starting on the 11th position. I 
think this is typical for Phylip. Other software (e.g., PhyML) accepts longer 
taxa names and require a space before the start of the sequence.

About your example: it depends on the order of the data. The following file can 
be read:

2 10
x AA
madagascarAA

But if you invert the two sequence lines, it fails.

It is the first time I hear about this problem in 9 years, maybe because it 
requires a particular combination of circumstances. Another drawback of this 
implementation is that files with less than 10 bases cannot be read.

How to solve this? If it were left only to me, I would deprecate the 
interleaved and sequential formats. FASTA is more flexible, more widespread, 
easier to parse, can store exactly the same information, and labels are only 
constrained to be on a single line (but can contain any characters including 
\n, \t, ...) But I guess many programs use the Phylip formats, so I'd be glad 
to read other suggestions.

As for your 2nd problem, it is now fixed in ape.

Best,

Emmanuel
-Original Message-
From: Dan Rabosky drabo...@umich.edu
Sender: r-sig-phylo-boun...@r-project.org
Date: Wed, 25 Apr 2012 17:51:35 
To: r-sig-phylo@r-project.org
Subject: [R-sig-phylo] read.dna warnings and pitfalls


Hi All-

I have spent an inordinate and embarrassing amount of time tracking down an 
excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
key problems:

1) The function automatically assumes it is reading DNA sequences when it 
encounters a string of 10 continuous DNA-like characters. This includes all 
characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the phylip 
original, does not have limits on taxon name lengths. Hence, I had - in the 
middle of a large alignment - a species whose name included the string 
MADAGASCAR, which caused a failure.  To be fair, the documentation warns of 
this, but I think this is extremely easy to overlook, and - moreover - it seems 
unfortunate to have to parse all your taxon names for a potential IUPAC match 
before trying to use the function. Presumably, most users who specify 
sequential spacing will be using whitespace to separate taxon names from DNA 
sequences, and perhaps it is better to exploit this rather than IUPAC matching. 

2) The function is whitespace-sensitive. if you tab-separate the numbers on the 
first line (numbers of taxa, numbers of sites), you'll receive an errror with 
the message: the first line of the file must contain the dimensions of the 
data. It appears that spaces are OK, however. 

Hopefully this post will be useful to somewhere in the future with a similar 
issue. Perhaps these can be addressed in a future update to ape? 

-Dan Rabosky

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] read.dna warnings and pitfalls

2012-04-26 Thread Dan Rabosky

Hi Emmanuel-

Thanks for fixing the whitespace issue. I think this fix will be useful to many 
users. 

On the issue of recognizing 10 IUPAC characters: I think this is a real 
problem, and may come up again in short order. Maybe it is just that use of 
this function has been limited? In the single dataset with a modest number of 
sequences that caused me problems yesterday, I had the following species and/or 
genus names - all of which constitute 10 character strings drawn from the set 
of IUPAC codes:

brachyurus (x 2)
savannarum
graduacauda
caudacutus
Camarhynchus (x 3)
madagascariensis

I don't suggest deprecating the phylip sequential, but rather, using something 
that is compatible with raxml (surely one of the most widely used phylogenetics 
programs today). I think raxml uses a relaxed sequential version of the phylip 
format with whitespace delimitation. I could read the same alignment in raxml 
with no problems, but I had multiple issues when reading the same file with 
read.dna (including the whitespace character on the first line). My guess is 
that very few people are using the original phylip format, with its limit of 10 
characters per taxon name, and with dna seqs beginning immediately after this. 
So maybe deprecate sequential phylip, but you could use what Stamatakis calls 
relaxed sequential PHYLIP, which appears to be: (1) taxon names cannot 
include spaces but can be up to 100 characters; and (2) names separated from 
sequences by whitespace character (ideally, this should recognize any number of 
spaces or tabs to prevent user confusion).

For users with tab-delimited raxml files (eg each taxon name separated from its 
dna sequence by a tab), you can use a regular-expressions enabled text editor 
(like textwrangler) to quickly find potential problems. Just search for 

[ACGTUMRWSYKVHDBN]{10}.+\t

with grep matching enabled. 

Cheers,
~Dan


On Apr 26, 2012, at 2:16 AM, Emmanuel Paradis wrote:

 Hi Dan,
 
 The reason for this implementation (searching the first 10 IUPAC-coded bases) 
 is because the exact formatting is not inconsistent among different programs. 
 Some files have:
 
 0123456789acgt.
 
 that is a 10-character name and the sequence starting on the 11th position. I 
 think this is typical for Phylip. Other software (e.g., PhyML) accepts longer 
 taxa names and require a space before the start of the sequence.
 
 About your example: it depends on the order of the data. The following file 
 can be read:
 
 2 10
 x AA
 madagascarAA
 
 But if you invert the two sequence lines, it fails.
 
 It is the first time I hear about this problem in 9 years, maybe because it 
 requires a particular combination of circumstances. Another drawback of this 
 implementation is that files with less than 10 bases cannot be read.
 
 How to solve this? If it were left only to me, I would deprecate the 
 interleaved and sequential formats. FASTA is more flexible, more widespread, 
 easier to parse, can store exactly the same information, and labels are only 
 constrained to be on a single line (but can contain any characters including 
 \n, \t, ...) But I guess many programs use the Phylip formats, so I'd be glad 
 to read other suggestions.
 
 As for your 2nd problem, it is now fixed in ape.
 
 Best,
 
 Emmanuel
 -Original Message-
 From: Dan Rabosky drabo...@umich.edu
 Sender: r-sig-phylo-boun...@r-project.org
 Date: Wed, 25 Apr 2012 17:51:35 
 To: r-sig-phylo@r-project.org
 Subject: [R-sig-phylo] read.dna warnings and pitfalls
 
 
 Hi All-
 
 I have spent an inordinate and embarrassing amount of time tracking down an 
 excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
 key problems:
 
 1) The function automatically assumes it is reading DNA sequences when it 
 encounters a string of 10 continuous DNA-like characters. This includes all 
 characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the phylip 
 original, does not have limits on taxon name lengths. Hence, I had - in the 
 middle of a large alignment - a species whose name included the string 
 MADAGASCAR, which caused a failure.  To be fair, the documentation warns of 
 this, but I think this is extremely easy to overlook, and - moreover - it 
 seems unfortunate to have to parse all your taxon names for a potential IUPAC 
 match before trying to use the function. Presumably, most users who specify 
 sequential spacing will be using whitespace to separate taxon names from DNA 
 sequences, and perhaps it is better to exploit this rather than IUPAC 
 matching. 
 
 2) The function is whitespace-sensitive. if you tab-separate the numbers on 
 the first line (numbers of taxa, numbers of sites), you'll receive an errror 
 with the message: the first line of the file must contain the dimensions of 
 the data. It appears that spaces are OK, however. 
 
 Hopefully this post will be useful to somewhere in the future with a similar 
 issue. Perhaps these can be addressed

[R-sig-phylo] read.dna warnings and pitfalls

2012-04-25 Thread Dan Rabosky

Hi All-

I have spent an inordinate and embarrassing amount of time tracking down an 
excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
key problems:

1) The function automatically assumes it is reading DNA sequences when it 
encounters a string of 10 continuous DNA-like characters. This includes all 
characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the phylip 
original, does not have limits on taxon name lengths. Hence, I had - in the 
middle of a large alignment - a species whose name included the string 
MADAGASCAR, which caused a failure.  To be fair, the documentation warns of 
this, but I think this is extremely easy to overlook, and - moreover - it seems 
unfortunate to have to parse all your taxon names for a potential IUPAC match 
before trying to use the function. Presumably, most users who specify 
sequential spacing will be using whitespace to separate taxon names from DNA 
sequences, and perhaps it is better to exploit this rather than IUPAC matching. 

2) The function is whitespace-sensitive. if you tab-separate the numbers on the 
first line (numbers of taxa, numbers of sites), you'll receive an errror with 
the message: the first line of the file must contain the dimensions of the 
data. It appears that spaces are OK, however. 

Hopefully this post will be useful to somewhere in the future with a similar 
issue. Perhaps these can be addressed in a future update to ape? 

-Dan Rabosky

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


[R-sig-phylo] read.dna warnings and pitfalls

2012-04-25 Thread Dan Rabosky

Hi All-

I have spent an inordinate and embarrassing amount of time tracking down an 
excruciatingly cryptic issue with read.dna, which I rarely use. Here are two 
key problems:

1) The function automatically assumes it is reading DNA sequences when it 
encounters a string of 10 continuous DNA-like characters. This includes all 
characters in the set (ACGTUMRWSYKVHDBN-). This function, unlike the phylip 
original, does not have limits on taxon name lengths. Hence, I had - in the 
middle of a large alignment - a species whose name included the string 
MADAGASCAR, which caused a failure.  To be fair, the documentation warns of 
this, but I think this is extremely easy to overlook, and - moreover - it seems 
unfortunate to have to parse all your taxon names for a potential IUPAC match 
before trying to use the function. Presumably, most users who specify 
sequential spacing will be using whitespace to separate taxon names from DNA 
sequences, and perhaps it is better to exploit this rather than IUPAC matching. 

2) The function is whitespace-sensitive. if you tab-separate the numbers on the 
first line (numbers of taxa, numbers of sites), you'll receive an errror with 
the message: the first line of the file must contain the dimensions of the 
data. It appears that spaces are OK, however. 

Hopefully this post will be useful to somewhere in the future with a similar 
issue. Perhaps these can be addressed in a future update to ape? 

-Dan Rabosky

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo