Wow, Thank you very much Andrej! Tal
----------------Contact Details:------------------------------------------------------- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- 2010/3/17 Andrej Blejec <andrej.ble...@nib.si> > A version using regular expressions, lot of regexpr() and substr() > functions is attached. > Finally everything is packed into splitSeq() function > > Andrej > > -- > Andrej Blejec > National Institute of Biology > Vecna pot 111 POB 141 > SI-1000 Ljubljana > SLOVENIA > e-mail: andrej.ble...@nib.si > URL: http://ablejec.nib.si > tel: + 386 (0)59 232 789 > fax: + 386 1 241 29 80 > -------------------------- > Local Organizer of ICOTS-8 > International Conference on Teaching Statistics > http://icots8.org > > > > > -----Original Message----- > > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > > project.org] On Behalf Of Gabor Grothendieck > > Sent: Tuesday, March 16, 2010 3:24 PM > > To: Tal Galili > > Cc: r-help@r-project.org; seqinr-fo...@r-forge.wu-wien.ac.at > > Subject: Re: [R] How to parse a string (by a "new" markup) with R ? > > > > We show how to use the gsubfn package to parse this. > > > > The rules are not entirely clear so we will assume the following: > > > > - there is a fixed template for the output which is the same as your > > output but possibly with different character strings filled in. This > > implies, for example, that there are exactly Stem0, Stem1, Stem2 and > > Stem3 and no fewer or more stems. > > > > - the sequence always starts with the open of Stem0, at least one dot > > and the open of Stem1. There are no dots prior to the open of Stem0. > > This seems to be implicit in your sample output since there is no zero > > length string in your sample output corresponding to dots prior to > > Stem0. > > > > - Stem0 closes with the same number of < as there are > to open it > > > > You can modify this yourself to take into account the actual rules > > whatever they are. > > > > We first calculate, k, the number of leading >'s using strapply. > > > > Then we replace the leading k >'s with }'s and the trailing k <'s with > > {'s giving us Str3: > > > > > > "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{ > > {{." > > > > We again use strapply, this time to get the lengths of the runs. Note > > that > > zero length runs are possible so we cannot, for example, use rle for > > this. For > > example there is a zero length run of dots between the last < and the > > first {. > > read.fwf is used to actually parse out the strings using the lengths we > > just > > calculated. > > > > Finally we fill in the template using relist. > > > > # inputs > > > > Seq <- > > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG > > GCA" > > Str <- > > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<< > > <<." > > template <- > > list( > > "Stem 0 opening" = "", > > "before Stem 1" = "", > > "Stem 1" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "between Stem 1 and 2" = "", > > "Stem 2" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "between Stem 2 and 3" = "", > > "Stem 3" = list(opening = "", > > inside = "", > > closing = "" > > ), > > "After Stem 3" = "", > > "Stem 0 closing" = "" > > ) > > > > # processing > > > > # create string made by repeating string s k times followed by more > > reps <- function(s, k, more = "") { > > paste(paste(rep(s, k), collapse = ""), more, sep = "") > > } > > > > library(gsubfn) > > k <- nchar(strapply(Str, "^>+", c)[[1]]) > > Str2 <- sub("^>+", reps("}", k), Str) > > Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2) > > > > pat <- > > "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]* > > )({*)([.]*)$" > > lens <- sapply(strapply(Str3, pat, c)[[1]], nchar) > > tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE)) > > closeAllConnections() > > tokens[is.na(tokens)] <- "" > > out <- relist(tokens, template) > > out > > > > > > Here is the str of the output for your sample input: > > > > > str(out) > > List of 9 > > $ Stem 0 opening : chr "GCCTCGA" > > $ before Stem 1 : chr "TA" > > $ Stem 1 :List of 3 > > ..$ opening: chr "GCTC" > > ..$ inside : chr "AGTTGGGA" > > ..$ closing: chr "GAGC" > > $ between Stem 1 and 2: chr "G" > > $ Stem 2 :List of 3 > > ..$ opening: chr "TACGA" > > ..$ inside : chr "CTGAAGA" > > ..$ closing: chr "TCGTA" > > $ between Stem 2 and 3: chr "AGGtC" > > $ Stem 3 :List of 3 > > ..$ opening: chr "ACCAG" > > ..$ inside : chr "TTCGATC" > > ..$ closing: chr "CTGGT" > > $ After Stem 3 : chr "" > > $ Stem 0 closing : chr "TCGGGGC" > > > > > > > > On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.gal...@gmail.com> > > wrote: > > > Hello all, > > > > > > For some work I am doing on RNA, I want to use R to do string parsing > > that > > > (I think) is like a simplistic HTML parsing. > > > > > > > > > For example, let's say we have the following two variables: > > > > > > Seq <- > > > > > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG > > GCA" > > > Str <- > > > > > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<< > > <<." > > > > > > Say that I want to parse "Seq" According to "Str", by using the > > legend here > > > > > > Seq: > > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGG > > CA > > > Str: > > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<< > > <. > > > > > > | | | | | | | > > || | > > > > > > +-----+ +--------------+ +---------------+ +--------------- > > ++-----+ > > > > > > | Stem 1 Stem 2 Stem 3 > > | > > > > > > | > > | > > > > > > +------------------------------------------------------------- > > ---+ > > > > > > Stem 0 > > > > > > Assume that we always have 4 stems (0 to 3), but that the length of > > letters > > > before and after each of them can very. > > > > > > The output should be something like the following list structure: > > > > > > > > > list( > > > "Stem 0 opening" = "GCCTCGA", > > > "before Stem 1" = "TA", > > > "Stem 1" = list(opening = "GCTC", > > > inside = "AGTTGGGA", > > > closing = "GAGC" > > > ), > > > "between Stem 1 and 2" = "G", > > > "Stem 2" = list(opening = "TACGA", > > > inside = "CTGAAGA", > > > closing = "TCGTA" > > > ), > > > "between Stem 2 and 3" = "AGGtC", > > > "Stem 3" = list(opening = "ACCAG", > > > inside = "TTCGATC", > > > closing = "CTGGT" > > > ), > > > "After Stem 3" = "", > > > "Stem 0 closing" = "TCGGGGC" > > > ) > > > > > > > > > I don't have any experience with programming a parser, and would like > > > advices as to what strategy to use when programming something like > > this (and > > > any recommended R commands to use). > > > > > > > > > What I was thinking of is to first get rid of the "Stem 0", then go > > through > > > the inner string with a recursive function (let's call it > > "seperate.stem") > > > that each time will split the string into: > > > 1. before stem > > > 2. opening stem > > > 3. inside stem > > > 4. closing stem > > > 5. after stem > > > > > > Where the "after stem" will then be recursively entered into the same > > > function ("seperate.stem") > > > > > > The thing is that I am not sure how to try and do this coding without > > using > > > a loop. > > > > > > Any advices will be most welcomed. > > > > > > > > > ----------------Contact > > > Details:------------------------------------------------------- > > > Contact me: tal.gal...@gmail.com | 972-52-7275845 > > > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il > > (Hebrew) | > > > www.r-statistics.com (English) > > > --------------------------------------------------------------------- > > ------------------------- > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.