Re: [Bioc-devel] Regarding the function: oligonucleotideFrequency for k-mers > 11 bps

Ulrich Bodenhofer Fri, 28 Nov 2014 01:25:56 -0800

Dear Rodrigo,

We recently released our 'kebabs' package as part of the Bioconductor3.0 release. Though this package is aimed at providing sequence kernelsfor classification, regression, and other tasks such as similarity-basedclustering, the package includes functionality that exactly matches yourneeds. See the following example:


   s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
   s1
   s2 <- DNAString(paste0(rep(s, 200), collate=""))
   s2

   library(kebabs)
   sk13 <- spectrumKernel(k=13, normalized=FALSE)
   system.time(kmerFreq <- drop(getExRep(s1, sk13)))
   kmerFreq
   system.time(kmerFreq <- drop(getExRep(s2, sk13)))
   kmerFreq

So you see that the k-mer frequencies are obtained as the explicitfeature vector of the standard (unnormalized) spectrum kernel with k=13.This function is implemented in highly efficient C++ code that builds upa prefix tree and only considers k-mers that actually occur in thesequence (as you requested). You see that even for k=13 and a sequencewith tens of thousands of bases, the computations only take fractions ofa second (19 msecs on our 5-year-old Dell server). The above functionalso works for DNAStringSets, but, in this case, you should remove thedrop() to get a matrix of k-mer frequencies. The matrix is by defaultsparse (class 'dgRMatrix'), but you can also enforce the result to be instandard dense matrix format (however, still omitting k-mers that do notoccur at all in any of the sequences):


   sv <- c(DNAStringSet(s), DNAStringSet(s2))
   system.time(kmerFreq <- getExRep(sv, sk13))
   kmerFreq
   system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
   kmerFreq

How long the k-mers may be, may depend on your system. On our system,the limit seems to be k=22 for DNA sequences. The same works for RNA andamino acid sequences. For the latter, however, the limits in terms of kare significantly lower, since the feature space is obviously muchlarger for the same k.


I hope that helps. If you have any further questions, please let us know.

Best regards,
Ulrich



Message-ID: <bay179-w424d25c87634bdebad7393c8...@phx.gbl>
From: Rodrigo Bertollo de Alexandre <rodrigodealexan...@hotmail.com>
To: "bioc-devel@r-project.org" <bioc-devel@r-project.org>
Date: Thu, 27 Nov 2014 21:38:39 -0200
Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
    k-mers > 11 bps

I've seen that it is almost impossible to work with k-mers as big as 13with this function. This is mainly because this function doesn't createa list of k-mers from the sequence but from all possible combinations.This is basically a bug, since in a big sequence of 1000 bps the maximumnumber of 13-mers is L-k+1 = 988. While the number of possible 13-mersis 4^k = 28561.This means that the code is basically analyzing 27573nonexistent k-mers.I'm wondering if there could have a modification in the packageregarding this issue...I did my own function for this (which it runs ok). However, having allyou need in a unique package would be even better...(I posted my code onthe stackoverflow: http://stackoverflow.com/a/27178731/4004499)

Sincerely,Rodrigo
    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Regarding the function: oligonucleotideFrequency for k-mers > 11 bps

Reply via email to