Re: [R-sig-phylo] DNA sequence management for phylogenetics in R

Nick Matzke Mon, 27 Jun 2011 13:14:01 -0700

After a fair amount of annoyment involving in shifting backand forth between BioPython and R, I also think it would beuseful to have BioPython-like sequence managementcapabilities in R. It would even be good to be able to dosome things like access NCBI genbank records and downloadthem, remote BLAST, etc.

My understanding is that the bioconductor package issupposed to have some of these capabilities, but

(a) to get their genbank function to work I had to hack itmyself to update the appropriate URL etc., which indicatesthat this part of bioconductor, at least, is not well-maintained


...and...

(b) the bioconductor set of packages is massive, but most ofit seems to be devoted to microarray analysis, which makesfinding the sequence stuff a bit of a needle-in-a-haystack

Has anyone else had experience/success with bioconductor forsequence & phylogenetics purposes?


Cheers,
Nick


On 3/17/09 12:06 PM, Christoph Heibl wrote:

Hi Dan, Emmanuel, Brian, Rphyloers ...

Now that Brian pointed towards the phyloch package, I think
I have to add a little more information.

First of all, although it goes perhaps into the direction of
what Dan is looking for, this not a mature system and surely
aimed to work on a smaller scale (and tailored towards my
specific needs which include a strong spatial emphasis). But
to let you be the judge - my approach is as follows:

(1) All my own sequences are stored as ASCII files with
their PCR number as unique identifier in a set of
directories. (They could be stored in database, of course,
but in my opinion the benefits of this don't outweigh the
additional step of work, especially if you work actively on
the electropherograms.)

(2) Attribute data (taxonomy, marker, primers, collector,
acc no., locality, coordinates, etc) is stored in a
postgreSQL database.

(3) Queries of the database generate vectors containing PCR
numbers, which are used to select the corresponding
sequences and bundle them into an alignment object (ape)
with 'make.fasta' (phyloch).

(4) If necessary, additional sequences from GenBank are
retrieved with Emmanuel´s 'read.GenBank' function and fused
to my sequences via 'c.alignment' (phyloch).

(5) I assemble partitions separately by calling MAFFT with
'mafft' (phyloch) and then fuse them with 'c.genes' (phylo).
Thereby I can create alignments where missing sequences are
filled with Ns or choose to delete all those sequences which
are not represented in all of the partitions.

(6) 'c.genes' matches sequences via their name. That means
before I concatenate partitions, I have to set appropriate
taxon names. Once again this is done with the postgreSQL
database using the function 'tax.labels' (phyloch). This
allows me to concatenate alignments with different degrees
of specificity. Example: If I want to create an
interspecific sampling covering geographic range of species,
I can choose taxonnames AND locality as sequence names in
order to get an alignment where more than one accession of
each/some species is represented and only those conspecifics
stemming from the same sites will be concatenated.

I admit that this is a very crude patchwork of functions,
but up to a certain dimension it serves its purpose. If
think in your endeavor, Dan, SQL is your friend, but the
main task will be: How to automate the extraction of the
sequences' attributes from varying sources. For Genbank this
could be done by more sophisticated version of
'read.GenBank'. Some time ago I tried to build a function
'search.GenBank', but was not successful. Perhaps Emmanuel
could help here. His class 'DNAbin' might also prove
important if you plan to handle real big datasets, as he
just pointed out. In this case, it would be desirable to
extend the binary format to unaligned sequences to speed up
data assembly prior to alignment.

Best wishes,

Christoph

PS: Parts of phyloch are poorly documented. Anyone
interested, please do not hesitate to ask.





On Mar 17, 2009, at 5:46 PM, Brian O'Meara wrote:

Christoph Heibl has some R code that calls mafft for
alignment (which I currently like better than Clustal,
btw) and others that can interact with a postgreSQL
database for storing info [according to the software
description -- I haven't tried this]. See
<http://www.christophheibl.de/Rpackages.html>.

Brian

On Mar 17, 2009, at 12:09 PM, Emmanuel Paradis wrote:

Dan,

It seems that the way DNA sequences are coded in ape with
the class "DNAbin" meets some of the criteria you list
below. Sequences are stored in vectors, lists of vectors,
or matrices. The usual methods for extracting and
subsetting ([, [[, $) have been written for this class.
There are also methods for rbind and cbind. I have
modified them recently so that "super-matrices" can be
built eventually filling some columns/rows with gaps.

There is no way to do sequence alignment directly into R
at the moment, but Clustal can be called with the
system() function and read.dna() can read clustal
alignment files, so this can be scripted easily.

About seqinr: I found it very useful for various things.
I used it recently to translate DNA into AA sequences. It
can also return the complement of a DNA sequence (ape
cannot for the moment).

HTH

EP

Le 06.03.2009 16:26, Dan Rabosky a écrit :

Howdy-
Maybe this is the wrong approach altogether, but I have
been using R to manage large sets of DNA sequences, to
generate input files for various phylogeny programs, to
keep track of output from large sequencing projects, and
so on. I have written a bit of code for this, but it
would be nice to get thoughts/input from others on what
an ideal approach might look like.
I can see where my current approach is probably going to
hit some walls. It also seems to me that this could be
useful for many folks. So, before I redesign my own
system, perhaps it would be nice to collectively come up
with something useful in this area. I would be very
interested in any brainstorming, suggestions, or
thoughts on what has and has not worked for you, if you
have been wrestling with something along these lines.
Disclaimer: I haven't used the seqinr package (just
noticed it), so I would also be interested in opinions
on whether people find this to be an adequate base package.
As a starting point, I'm assuming that you are working
with cleaned sequences, that you have thousands of
individual sequences from many loci, potentially
thousands of individuals - each of which may been
sequenced for some, but not necessarily all, loci.
As I've done this so far, I store all individual
sequences as in a general class of object which contains
the relevant taxonomic info, genbank ID (if available),
and other info. I can then feed character vectors of
individual ID's plus a second vector of locus names and
I can generate a range of files in the appropriate
format for various phylogeny software. If taxa cannot be
found, they receive the relevant missing data code in
that block of the input file.
Anyway, ideally, one would want:
(1) the ability to seamlessly integrate new sequences
into the existing database from a variety of sources
(new user-generated sequences, large fasta files from
genbank, etc).
(2) Some way of dealing with alignments. Suppose I have
500 individuals for some particular locus; everything is
aligned and the aligned seqs go into the database. But
if I add more individuals, I need to re-align everything
with the new individuals and update the database
accordingly. I think it might be useful to keep track of
three pieces of information here: (a) the raw sequence,
(b) the aligned sequence, and (c) the 'version' or date
of alignment. Thus, if you added new sequences to the
database or merged several databases, you could check
item (c) to see if they were of the same alignment
version. If not, then you export everything for
alignment, align, import, and update the database
accordingly.
(3) A manner of storing the actual data that has some
transparency. Meaning that if I create some bizarre file
format for storing all of this info, it might be really
difficult for someone other than myself to interpret. As
I do it now, I use an xml-type format is generally
unintelligible. Because others who do not use R may wish
to access these data the hard way, this is problematic.
Perhaps I would be better off with SQL, or maybe these
tools are so obviously available in perl, bioperl, etc
that it will seem odd that I even attempted this. If so,
I'd like to be pointed in the right direction!
Thanks!
~Dan
Dan Rabosky
Department of Ecology and Evolutionary Biology &
Fuller Evolutionary Biology Program
Cornell Lab of Ornithology
Cornell University
Ithaca, NY 14853-2701
new website:
http://www.eeb.cornell.edu/Rabosky/dan/main.html
[[alternative HTML version deleted]]
_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


--
Emmanuel Paradis
IRD, Montpellier, France
ph: +33 (0)4 67 16 64 47
fax: +33 (0)4 67 16 64 40
http://ape.mpl.ird.fr/

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher

Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Graduate Student Instructor, IB200B
Principles of Phylogenetics: Ecology and Evolution
http://ib.berkeley.edu/courses/ib200b/
http://phylo.wikidot.com/


Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html

Dept. personal page:http://ib.berkeley.edu/people/students/person_detail.php?person=370Lab personal page:http://fisher.berkeley.edu/cteg/members/matzke.html

Lab phone: 510-643-6299
Dept. fax: 510-643-6264

Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------

"[W]hen people thought the earth was flat, they were wrong.When people thought the earth was spherical, they werewrong. But if you think that thinking the earth is sphericalis just as wrong as thinking the earth is flat, then yourview is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." TheSkeptical Inquirer, 14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] DNA sequence management for phylogenetics in R

Reply via email to