On Aug 31, 2011, at 3:20 PM, wang peter wrote:
dear Harris A. Jaffee,
h...@jhu.edu

thank u very much for your explaination.
but i am still confused because of not knowing its algorithm
could u tell me if i can read them from lowlevel_matching.c

let me try to concluded what u have told me:

   1. the algorithm will change every max.Rmismatch to a vector.
by adding -1 if it is not 0.
      e.g.
if max.Lmismatch=0 it will do : max.Lmismatch = rep(0, nchar (Lpattern))
   if max.Lmismatch=c(2,2)   it will get 2 2 -1 -1 -1
    how does the algorithm use -1 or 0 to control the alignment??

See ?trimLRPatterns. Yes, 0 is a special value, as is any scalar in [0,1). These are interpreted as "rates", relative to the length of the substring of the pattern to be tested. (See ?agrep) For Lpattern, such a scalar s
is converted internally to

        as.integer(s * 1:nchar(Lpattern))

The order of this vector of max.mismatch limits is counter-intuitive (and wasn't my idea). A sanity check is that it should be monotone increasing, not necessarily strictly, as with the above construction using the rate s.
The suffixes of the Lpattern are tested in the order longest first, with
the first match winning. [They originally were tested shortest first, and this explains the ordering of the vector, at least from the implementation perspective. But in this scenario, you cannot know when you are done, so you inevitably waste time.] As I said, the mismatch limit for the test of the pattern suffix of length i is max.Lmismatch[i], hence the elements of
the vector max.Lmismatch are used from the top down.

As I said, max.Lmismatch values shorter than nchar(Lpattern) are padded
with -1's at the low end. E.g., contrary to what you wrote above, c (2,2)
will be expanded into -1, -1, .... 2, 2.

See ?isMatchingAt -- especially the Details section. 0 as a max.mismatch
limit means that only perfect matches are acceptable (which rules out
matches involving edits of indels); -1 as a max.mismatch limit *prevents*
any match, since any edit distance is non-negative.
   2. the algorithm will try any suffix segment on  Lmismatch by
   substr(Lpattern, i, nchar(Lpattern)), i = 1:nchar(Lpattern)
      e.g. Lpattern = "TTTAACGT"
   it will try "T","GT",......

Yes, but in the opposite order: "TTTAACGT", "TTAACGT", .... "GT", "T".

3. during each try, if the length of segment is equal to the lenght of
   subject, the algoritm will stop, except it can allow indel.

I'm not sure what you mean with the stopping.
   4. and indel should be counted as a mismatch

If indels are permitted and used, each letter added or removed is counted
as 1 toward the edit distance.  See the examples on ?isMatchingAt .



>  subject = "TTTACGT"
> Lpattern = "TTTAACGT"         # pattern has an extra 'A'

# need to allow for 1 "edit", to remove the extra A
> trimLRPatterns(Lpattern = Lpattern, subject = subject, max.Lmismatch=1,
        with.Lindels=TRUE)
[1] ""

trimLRPatterns(Lpattern = Lpattern, subject = subject, max.Lmismatch=4)
[1] ""
i think when the algorithm take "TTAACGT", never try "TTTAACGT",
because no indel allowed.



_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to