Re: [R-sig-phylo] DNA sequence management for phylogenetics in R

2011-06-27 Thread Nick Matzke
After a fair amount of annoyment involving in shifting back 
and forth between BioPython and R, I also think it would be 
useful to have BioPython-like sequence management 
capabilities in R.  It would even be good to be able to do 
some things like access NCBI genbank records and download 
them, remote BLAST, etc.


My understanding is that the bioconductor package is 
supposed to have some of these capabilities, but


(a) to get their genbank function to work I had to hack it 
myself to update the appropriate URL etc., which indicates 
that this part of bioconductor, at least, is not well-maintained


...and...

(b) the bioconductor set of packages is massive, but most of 
it seems to be devoted to microarray analysis, which makes 
finding the sequence stuff a bit of a needle-in-a-haystack


Has anyone else had experience/success with bioconductor for 
sequence  phylogenetics purposes?


Cheers,
Nick


On 3/17/09 12:06 PM, Christoph Heibl wrote:

Hi Dan, Emmanuel, Brian, Rphyloers ...

Now that Brian pointed towards the phyloch package, I think
I have to add a little more information.

First of all, although it goes perhaps into the direction of
what Dan is looking for, this not a mature system and surely
aimed to work on a smaller scale (and tailored towards my
specific needs which include a strong spatial emphasis). But
to let you be the judge - my approach is as follows:

(1) All my own sequences are stored as ASCII files with
their PCR number as unique identifier in a set of
directories. (They could be stored in database, of course,
but in my opinion the benefits of this don't outweigh the
additional step of work, especially if you work actively on
the electropherograms.)

(2) Attribute data (taxonomy, marker, primers, collector,
acc no., locality, coordinates, etc) is stored in a
postgreSQL database.

(3) Queries of the database generate vectors containing PCR
numbers, which are used to select the corresponding
sequences and bundle them into an alignment object (ape)
with 'make.fasta' (phyloch).

(4) If necessary, additional sequences from GenBank are
retrieved with Emmanuel´s 'read.GenBank' function and fused
to my sequences via 'c.alignment' (phyloch).

(5) I assemble partitions separately by calling MAFFT with
'mafft' (phyloch) and then fuse them with 'c.genes' (phylo).
Thereby I can create alignments where missing sequences are
filled with Ns or choose to delete all those sequences which
are not represented in all of the partitions.

(6) 'c.genes' matches sequences via their name. That means
before I concatenate partitions, I have to set appropriate
taxon names. Once again this is done with the postgreSQL
database using the function 'tax.labels' (phyloch). This
allows me to concatenate alignments with different degrees
of specificity. Example: If I want to create an
interspecific sampling covering geographic range of species,
I can choose taxonnames AND locality as sequence names in
order to get an alignment where more than one accession of
each/some species is represented and only those conspecifics
stemming from the same sites will be concatenated.

I admit that this is a very crude patchwork of functions,
but up to a certain dimension it serves its purpose. If
think in your endeavor, Dan, SQL is your friend, but the
main task will be: How to automate the extraction of the
sequences' attributes from varying sources. For Genbank this
could be done by more sophisticated version of
'read.GenBank'. Some time ago I tried to build a function
'search.GenBank', but was not successful. Perhaps Emmanuel
could help here. His class 'DNAbin' might also prove
important if you plan to handle real big datasets, as he
just pointed out. In this case, it would be desirable to
extend the binary format to unaligned sequences to speed up
data assembly prior to alignment.

Best wishes,

Christoph

PS: Parts of phyloch are poorly documented. Anyone
interested, please do not hesitate to ask.





On Mar 17, 2009, at 5:46 PM, Brian O'Meara wrote:


Christoph Heibl has some R code that calls mafft for
alignment (which I currently like better than Clustal,
btw) and others that can interact with a postgreSQL
database for storing info [according to the software
description -- I haven't tried this]. See
http://www.christophheibl.de/Rpackages.html.

Brian

On Mar 17, 2009, at 12:09 PM, Emmanuel Paradis wrote:


Dan,

It seems that the way DNA sequences are coded in ape with
the class DNAbin meets some of the criteria you list
below. Sequences are stored in vectors, lists of vectors,
or matrices. The usual methods for extracting and
subsetting ([, [[, $) have been written for this class.
There are also methods for rbind and cbind. I have
modified them recently so that super-matrices can be
built eventually filling some columns/rows with gaps.

There is no way to do sequence alignment directly into R
at the moment, but Clustal can be called with the
system() function and read.dna() can read clustal
alignment files, so this 

Re: [R-sig-phylo] DNA sequence management for phylogenetics in R

2011-06-27 Thread Jombart, Thibaut
Dear Nick, 

I think ape has a read.GenBank function which seemed to work decently last time 
I used it. You may also want to check the seqinr package, especially devoted to 
interacting with databases and retrieving sequence data. Both packages are on 
CRAN. As far as phylogenetics is concerned, I would say most useful packages 
(ape, phangorn) are definitely on CRAN and not Bioconductor.

Cheers

Thibaut


From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on 
behalf of Nick Matzke [mat...@berkeley.edu]
Sent: 27 June 2011 21:12
To: r-sig-phylo@r-project.org
Subject: Re: [R-sig-phylo] DNA sequence management for phylogenetics in R

After a fair amount of annoyment involving in shifting back
and forth between BioPython and R, I also think it would be
useful to have BioPython-like sequence management
capabilities in R.  It would even be good to be able to do
some things like access NCBI genbank records and download
them, remote BLAST, etc.

My understanding is that the bioconductor package is
supposed to have some of these capabilities, but

(a) to get their genbank function to work I had to hack it
myself to update the appropriate URL etc., which indicates
that this part of bioconductor, at least, is not well-maintained

...and...

(b) the bioconductor set of packages is massive, but most of
it seems to be devoted to microarray analysis, which makes
finding the sequence stuff a bit of a needle-in-a-haystack

Has anyone else had experience/success with bioconductor for
sequence  phylogenetics purposes?

Cheers,
Nick


On 3/17/09 12:06 PM, Christoph Heibl wrote:
 Hi Dan, Emmanuel, Brian, Rphyloers ...

 Now that Brian pointed towards the phyloch package, I think
 I have to add a little more information.

 First of all, although it goes perhaps into the direction of
 what Dan is looking for, this not a mature system and surely
 aimed to work on a smaller scale (and tailored towards my
 specific needs which include a strong spatial emphasis). But
 to let you be the judge - my approach is as follows:

 (1) All my own sequences are stored as ASCII files with
 their PCR number as unique identifier in a set of
 directories. (They could be stored in database, of course,
 but in my opinion the benefits of this don't outweigh the
 additional step of work, especially if you work actively on
 the electropherograms.)

 (2) Attribute data (taxonomy, marker, primers, collector,
 acc no., locality, coordinates, etc) is stored in a
 postgreSQL database.

 (3) Queries of the database generate vectors containing PCR
 numbers, which are used to select the corresponding
 sequences and bundle them into an alignment object (ape)
 with 'make.fasta' (phyloch).

 (4) If necessary, additional sequences from GenBank are
 retrieved with Emmanuel´s 'read.GenBank' function and fused
 to my sequences via 'c.alignment' (phyloch).

 (5) I assemble partitions separately by calling MAFFT with
 'mafft' (phyloch) and then fuse them with 'c.genes' (phylo).
 Thereby I can create alignments where missing sequences are
 filled with Ns or choose to delete all those sequences which
 are not represented in all of the partitions.

 (6) 'c.genes' matches sequences via their name. That means
 before I concatenate partitions, I have to set appropriate
 taxon names. Once again this is done with the postgreSQL
 database using the function 'tax.labels' (phyloch). This
 allows me to concatenate alignments with different degrees
 of specificity. Example: If I want to create an
 interspecific sampling covering geographic range of species,
 I can choose taxonnames AND locality as sequence names in
 order to get an alignment where more than one accession of
 each/some species is represented and only those conspecifics
 stemming from the same sites will be concatenated.

 I admit that this is a very crude patchwork of functions,
 but up to a certain dimension it serves its purpose. If
 think in your endeavor, Dan, SQL is your friend, but the
 main task will be: How to automate the extraction of the
 sequences' attributes from varying sources. For Genbank this
 could be done by more sophisticated version of
 'read.GenBank'. Some time ago I tried to build a function
 'search.GenBank', but was not successful. Perhaps Emmanuel
 could help here. His class 'DNAbin' might also prove
 important if you plan to handle real big datasets, as he
 just pointed out. In this case, it would be desirable to
 extend the binary format to unaligned sequences to speed up
 data assembly prior to alignment.

 Best wishes,

 Christoph

 PS: Parts of phyloch are poorly documented. Anyone
 interested, please do not hesitate to ask.





 On Mar 17, 2009, at 5:46 PM, Brian O'Meara wrote:

 Christoph Heibl has some R code that calls mafft for
 alignment (which I currently like better than Clustal,
 btw) and others that can interact with a postgreSQL
 database for storing info [according to the software
 description -- I haven't