Re: [Bioc-sig-seq] a question about trimLRPatterns

Harris A. Jaffee Wed, 31 Aug 2011 13:47:06 -0700

On Aug 31, 2011, at 3:20 PM, wang peter wrote:
dear Harris A. Jaffee,

h...@jhu.edu


thank u very much for your explaination.
but i am still confused because of not knowing its algorithm
could u tell me if i can read them from lowlevel_matching.c

let me try to concluded what u have told me:

   1. the algorithm will change every max.Rmismatch to a vector.
by adding -1 if it is not 0.
      e.g.

if max.Lmismatch=0 it will do : max.Lmismatch = rep(0, nchar(Lpattern))

   if max.Lmismatch=c(2,2)   it will get 2 2 -1 -1 -1
    how does the algorithm use -1 or 0 to control the alignment??

See ?trimLRPatterns. Yes, 0 is a special value, as is any scalar in[0,1).These are interpreted as "rates", relative to the length of thesubstringof the pattern to be tested. (See ?agrep) For Lpattern, such ascalar s

is converted internally to

        as.integer(s * 1:nchar(Lpattern))

The order of this vector of max.mismatch limits is counter-intuitive(andwasn't my idea). A sanity check is that it should be monotoneincreasing,not necessarily strictly, as with the above construction using therate s.

The suffixes of the Lpattern are tested in the order longest first, with

the first match winning. [They originally were tested shortestfirst, andthis explains the ordering of the vector, at least from theimplementationperspective. But in this scenario, you cannot know when you aredone, soyou inevitably waste time.] As I said, the mismatch limit for thetest ofthe pattern suffix of length i is max.Lmismatch[i], hence theelements of

the vector max.Lmismatch are used from the top down.

As I said, max.Lmismatch values shorter than nchar(Lpattern) are padded

with -1's at the low end. E.g., contrary to what you wrote above, c(2,2)

will be expanded into -1, -1, .... 2, 2.

See ?isMatchingAt -- especially the Details section. 0 as amax.mismatch

limit means that only perfect matches are acceptable (which rules out

matches involving edits of indels); -1 as a max.mismatch limit*prevents*

any match, since any edit distance is non-negative.

   2. the algorithm will try any suffix segment on  Lmismatch by
   substr(Lpattern, i, nchar(Lpattern)), i = 1:nchar(Lpattern)
      e.g. Lpattern = "TTTAACGT"
   it will try "T","GT",......

Yes, but in the opposite order: "TTTAACGT", "TTAACGT", .... "GT", "T".

3. during each try, if the length of segment is equal to thelenght of
   subject, the algoritm will stop, except it can allow indel.

I'm not sure what you mean with the stopping.

   4. and indel should be counted as a mismatch

If indels are permitted and used, each letter added or removed iscounted

as 1 toward the edit distance.  See the examples on ?isMatchingAt .



>  subject = "TTTACGT"
> Lpattern = "TTTAACGT"         # pattern has an extra 'A'

# need to allow for 1 "edit", to remove the extra A
> trimLRPatterns(Lpattern = Lpattern, subject = subject,max.Lmismatch=1,
        with.Lindels=TRUE)
[1] ""
trimLRPatterns(Lpattern = Lpattern, subject = subject,max.Lmismatch=4)
[1] ""
i think when the algorithm take "TTAACGT", never try "TTTAACGT",
because no indel allowed.


_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] a question about trimLRPatterns

Reply via email to