Re: [R-sig-phylo] DNA sequence management for phylogenetics in R
After a fair amount of annoyment involving in shifting back and forth between BioPython and R, I also think it would be useful to have BioPython-like sequence management capabilities in R. It would even be good to be able to do some things like access NCBI genbank records and download them, remote BLAST, etc. My understanding is that the bioconductor package is supposed to have some of these capabilities, but (a) to get their genbank function to work I had to hack it myself to update the appropriate URL etc., which indicates that this part of bioconductor, at least, is not well-maintained ...and... (b) the bioconductor set of packages is massive, but most of it seems to be devoted to microarray analysis, which makes finding the sequence stuff a bit of a needle-in-a-haystack Has anyone else had experience/success with bioconductor for sequence phylogenetics purposes? Cheers, Nick On 3/17/09 12:06 PM, Christoph Heibl wrote: Hi Dan, Emmanuel, Brian, Rphyloers ... Now that Brian pointed towards the phyloch package, I think I have to add a little more information. First of all, although it goes perhaps into the direction of what Dan is looking for, this not a mature system and surely aimed to work on a smaller scale (and tailored towards my specific needs which include a strong spatial emphasis). But to let you be the judge - my approach is as follows: (1) All my own sequences are stored as ASCII files with their PCR number as unique identifier in a set of directories. (They could be stored in database, of course, but in my opinion the benefits of this don't outweigh the additional step of work, especially if you work actively on the electropherograms.) (2) Attribute data (taxonomy, marker, primers, collector, acc no., locality, coordinates, etc) is stored in a postgreSQL database. (3) Queries of the database generate vectors containing PCR numbers, which are used to select the corresponding sequences and bundle them into an alignment object (ape) with 'make.fasta' (phyloch). (4) If necessary, additional sequences from GenBank are retrieved with Emmanuel´s 'read.GenBank' function and fused to my sequences via 'c.alignment' (phyloch). (5) I assemble partitions separately by calling MAFFT with 'mafft' (phyloch) and then fuse them with 'c.genes' (phylo). Thereby I can create alignments where missing sequences are filled with Ns or choose to delete all those sequences which are not represented in all of the partitions. (6) 'c.genes' matches sequences via their name. That means before I concatenate partitions, I have to set appropriate taxon names. Once again this is done with the postgreSQL database using the function 'tax.labels' (phyloch). This allows me to concatenate alignments with different degrees of specificity. Example: If I want to create an interspecific sampling covering geographic range of species, I can choose taxonnames AND locality as sequence names in order to get an alignment where more than one accession of each/some species is represented and only those conspecifics stemming from the same sites will be concatenated. I admit that this is a very crude patchwork of functions, but up to a certain dimension it serves its purpose. If think in your endeavor, Dan, SQL is your friend, but the main task will be: How to automate the extraction of the sequences' attributes from varying sources. For Genbank this could be done by more sophisticated version of 'read.GenBank'. Some time ago I tried to build a function 'search.GenBank', but was not successful. Perhaps Emmanuel could help here. His class 'DNAbin' might also prove important if you plan to handle real big datasets, as he just pointed out. In this case, it would be desirable to extend the binary format to unaligned sequences to speed up data assembly prior to alignment. Best wishes, Christoph PS: Parts of phyloch are poorly documented. Anyone interested, please do not hesitate to ask. On Mar 17, 2009, at 5:46 PM, Brian O'Meara wrote: Christoph Heibl has some R code that calls mafft for alignment (which I currently like better than Clustal, btw) and others that can interact with a postgreSQL database for storing info [according to the software description -- I haven't tried this]. See http://www.christophheibl.de/Rpackages.html. Brian On Mar 17, 2009, at 12:09 PM, Emmanuel Paradis wrote: Dan, It seems that the way DNA sequences are coded in ape with the class DNAbin meets some of the criteria you list below. Sequences are stored in vectors, lists of vectors, or matrices. The usual methods for extracting and subsetting ([, [[, $) have been written for this class. There are also methods for rbind and cbind. I have modified them recently so that super-matrices can be built eventually filling some columns/rows with gaps. There is no way to do sequence alignment directly into R at the moment, but Clustal can be called with the system() function and read.dna() can read clustal alignment files, so this
Re: [R-sig-phylo] DNA sequence management for phylogenetics in R
Dear Nick, I think ape has a read.GenBank function which seemed to work decently last time I used it. You may also want to check the seqinr package, especially devoted to interacting with databases and retrieving sequence data. Both packages are on CRAN. As far as phylogenetics is concerned, I would say most useful packages (ape, phangorn) are definitely on CRAN and not Bioconductor. Cheers Thibaut From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Nick Matzke [mat...@berkeley.edu] Sent: 27 June 2011 21:12 To: r-sig-phylo@r-project.org Subject: Re: [R-sig-phylo] DNA sequence management for phylogenetics in R After a fair amount of annoyment involving in shifting back and forth between BioPython and R, I also think it would be useful to have BioPython-like sequence management capabilities in R. It would even be good to be able to do some things like access NCBI genbank records and download them, remote BLAST, etc. My understanding is that the bioconductor package is supposed to have some of these capabilities, but (a) to get their genbank function to work I had to hack it myself to update the appropriate URL etc., which indicates that this part of bioconductor, at least, is not well-maintained ...and... (b) the bioconductor set of packages is massive, but most of it seems to be devoted to microarray analysis, which makes finding the sequence stuff a bit of a needle-in-a-haystack Has anyone else had experience/success with bioconductor for sequence phylogenetics purposes? Cheers, Nick On 3/17/09 12:06 PM, Christoph Heibl wrote: Hi Dan, Emmanuel, Brian, Rphyloers ... Now that Brian pointed towards the phyloch package, I think I have to add a little more information. First of all, although it goes perhaps into the direction of what Dan is looking for, this not a mature system and surely aimed to work on a smaller scale (and tailored towards my specific needs which include a strong spatial emphasis). But to let you be the judge - my approach is as follows: (1) All my own sequences are stored as ASCII files with their PCR number as unique identifier in a set of directories. (They could be stored in database, of course, but in my opinion the benefits of this don't outweigh the additional step of work, especially if you work actively on the electropherograms.) (2) Attribute data (taxonomy, marker, primers, collector, acc no., locality, coordinates, etc) is stored in a postgreSQL database. (3) Queries of the database generate vectors containing PCR numbers, which are used to select the corresponding sequences and bundle them into an alignment object (ape) with 'make.fasta' (phyloch). (4) If necessary, additional sequences from GenBank are retrieved with Emmanuel´s 'read.GenBank' function and fused to my sequences via 'c.alignment' (phyloch). (5) I assemble partitions separately by calling MAFFT with 'mafft' (phyloch) and then fuse them with 'c.genes' (phylo). Thereby I can create alignments where missing sequences are filled with Ns or choose to delete all those sequences which are not represented in all of the partitions. (6) 'c.genes' matches sequences via their name. That means before I concatenate partitions, I have to set appropriate taxon names. Once again this is done with the postgreSQL database using the function 'tax.labels' (phyloch). This allows me to concatenate alignments with different degrees of specificity. Example: If I want to create an interspecific sampling covering geographic range of species, I can choose taxonnames AND locality as sequence names in order to get an alignment where more than one accession of each/some species is represented and only those conspecifics stemming from the same sites will be concatenated. I admit that this is a very crude patchwork of functions, but up to a certain dimension it serves its purpose. If think in your endeavor, Dan, SQL is your friend, but the main task will be: How to automate the extraction of the sequences' attributes from varying sources. For Genbank this could be done by more sophisticated version of 'read.GenBank'. Some time ago I tried to build a function 'search.GenBank', but was not successful. Perhaps Emmanuel could help here. His class 'DNAbin' might also prove important if you plan to handle real big datasets, as he just pointed out. In this case, it would be desirable to extend the binary format to unaligned sequences to speed up data assembly prior to alignment. Best wishes, Christoph PS: Parts of phyloch are poorly documented. Anyone interested, please do not hesitate to ask. On Mar 17, 2009, at 5:46 PM, Brian O'Meara wrote: Christoph Heibl has some R code that calls mafft for alignment (which I currently like better than Clustal, btw) and others that can interact with a postgreSQL database for storing info [according to the software description -- I haven't