Dear Rodrigo,

We recently released our 'kebabs' package as part of the Bioconductor 3.0 release. Though this package is aimed at providing sequence kernels for classification, regression, and other tasks such as similarity-based clustering, the package includes functionality that exactly matches your needs. See the following example:

   s1 <- DNAString("ATCGATCGATCGATCGATCGATCGACTGACTAGCTAGCTACGATCGACTG")
   s1
   s2 <- DNAString(paste0(rep(s, 200), collate=""))
   s2

   library(kebabs)
   sk13 <- spectrumKernel(k=13, normalized=FALSE)
   system.time(kmerFreq <- drop(getExRep(s1, sk13)))
   kmerFreq
   system.time(kmerFreq <- drop(getExRep(s2, sk13)))
   kmerFreq

So you see that the k-mer frequencies are obtained as the explicit feature vector of the standard (unnormalized) spectrum kernel with k=13. This function is implemented in highly efficient C++ code that builds up a prefix tree and only considers k-mers that actually occur in the sequence (as you requested). You see that even for k=13 and a sequence with tens of thousands of bases, the computations only take fractions of a second (19 msecs on our 5-year-old Dell server). The above function also works for DNAStringSets, but, in this case, you should remove the drop() to get a matrix of k-mer frequencies. The matrix is by default sparse (class 'dgRMatrix'), but you can also enforce the result to be in standard dense matrix format (however, still omitting k-mers that do not occur at all in any of the sequences):

   sv <- c(DNAStringSet(s), DNAStringSet(s2))
   system.time(kmerFreq <- getExRep(sv, sk13))
   kmerFreq
   system.time(kmerFreq <- getExRep(sv, sk13, sparse=FALSE))
   kmerFreq

How long the k-mers may be, may depend on your system. On our system, the limit seems to be k=22 for DNA sequences. The same works for RNA and amino acid sequences. For the latter, however, the limits in terms of k are significantly lower, since the feature space is obviously much larger for the same k.

I hope that helps. If you have any further questions, please let us know.

Best regards,
Ulrich



Message-ID: <bay179-w424d25c87634bdebad7393c8...@phx.gbl>
From: Rodrigo Bertollo de Alexandre <rodrigodealexan...@hotmail.com>
To: "bioc-devel@r-project.org" <bioc-devel@r-project.org>
Date: Thu, 27 Nov 2014 21:38:39 -0200
Subject: [Bioc-devel] Regarding the function: oligonucleotideFrequency for
    k-mers > 11 bps

I've seen that it is almost impossible to work with k-mers as big as 13 with this function. This is mainly because this function doesn't create a list of k-mers from the sequence but from all possible combinations. This is basically a bug, since in a big sequence of 1000 bps the maximum number of 13-mers is L-k+1 = 988. While the number of possible 13-mers is 4^k = 28561.This means that the code is basically analyzing 27573 nonexistent k-mers. I'm wondering if there could have a modification in the package regarding this issue... I did my own function for this (which it runs ok). However, having all you need in a unique package would be even better...(I posted my code on the stackoverflow: http://stackoverflow.com/a/27178731/4004499)
Sincerely,Rodrigo
    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to