Re: [R] Is there a better way to parse strings than this?
Thanks for the explanation, I think I understand it now. So to paraphrase all your explanations To match "." in a regular expression then the string "\.\.\." needs to be passed to it. This tells it to escape the special meaning of ".". But in order to get the \ into the string being passed to the function I also need to escape its special meaning, so I need to use "\\.\\.\\." Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP Commercialisation and Innovation, Data Analysis, Modelling and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 ch...@trickysolutions.com.au -Original Message- From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley Wickham Sent: Friday, 15 April 2011 11:07 AM To: Chris Howden Cc: r-help@r-project.org Subject: Re: [R] Is there a better way to parse strings than this? > I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables > and Ripleys book to "(use '\.' to match '.')", which is in the Regular > expressions section. > > I noticed that in the suggestions sent to me people used: > strsplit(test,"\\.\\.\\.") > > > Could anyone please explain why I should have used "\\.\\.\\." rather than > "\.\.\."? Basically, * you want to match . * so the regular expression you need is \. * and the way you represent that in a string in R is \\. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
not everything has to be done in R. awk and sed are some of the best tools on a linux/unix box. quick refs: http://www.pement.org/awk/awk1line.txt http://sed.sourceforge.net/sed1line.txt -Whit On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden wrote: > Hi Everyone, > > > I needed to parse some strings recently. > > The code I've wound up using seems rather clunky, and I was wondering if > anyone had any suggestions on a better way? > > Basically I do the following: > > 1) Use substr() to do the parsing > 2) Use regexpr() to find the location of the string I want to parse on, I > then pass this onto substr() > 3) Use nchar() as the stop input to substr() where necessary > > > > I've got a simple example of the parsing code I used below. It takes > questionnaire variable names that includes the question and the brand it > was answered for and then parses it so the variable name and the brand are > in separate columns. I then use this to restructure the data from > unstacked to stacked, but that's another story. > >> # this is the data set >> test > [1] "A5.Brands.bought...Dulux" > [2] "A5.Brands.bought...Haymes" > [3] "A5.Brands.bought...Solver" > [4] "A5.Brands.bought...Taubmans.or.Bristol" > [5] "A5.Brands.bought...Wattyl" > [6] "A5.Brands.bought...Other" > >> # Where do I want to parse? >> break1 <- regexpr('...',test, fixed=TRUE) >> break1 > [1] 17 17 17 17 17 17 > attr(,"match.length") > [1] 3 3 3 3 3 3 > >> # Put Variable name in a variable >> str1 <- substr(test,1,break1-1) >> str1 > [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" > "A5.Brands.bought" > [5] "A5.Brands.bought" "A5.Brands.bought" > >> # Put Brand name in a variable >> str2 <- substr(test,break1+3, nchar(test)) >> str2 > [1] "Dulux" "Haymes" "Solver" > [4] "Taubmans.or.Bristol" "Wattyl" "Other" > > > > Thanks for any and all suggestions > > > Chris Howden > Founding Partner > Tricky Solutions > Tricky Solutions 4 Tricky Problems > Evidence Based Strategic Development, IP Commercialisation and Innovation, > Data Analysis, Modelling and Training > (mobile) 0410 689 945 > (fax / office) (+618) 8952 7878 > ch...@trickysolutions.com.au > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
On Thu, Apr 14, 2011 at 8:28 PM, Chris Howden wrote: > Thanks for the suggestions, they were all exactly what I was looking for. > (I knew that had to be a more elegant way then my brute force method) > > One question though. > > I was playing around with strsplit but couldn't get it to work, I realised > my problem was that I was using "." as the string. > > I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables > and Ripleys book to "(use '\.' to match '.')", which is in the Regular > expressions section. > > I noticed that in the suggestions sent to me people used: > strsplit(test,"\\.\\.\\.") > > > Could anyone please explain why I should have used "\\.\\.\\." rather than > "\.\.\."? > "\\.\\.\\." is the string \.\.\. For example, try this > cat("\\.\\.\\.\n") \.\.\. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
> I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables > and Ripleys book to "(use '\.' to match '.')", which is in the Regular > expressions section. > > I noticed that in the suggestions sent to me people used: > strsplit(test,"\\.\\.\\.") > > > Could anyone please explain why I should have used "\\.\\.\\." rather than > "\.\.\."? Basically, * you want to match . * so the regular expression you need is \. * and the way you represent that in a string in R is \\. Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
Thanks for the suggestions, they were all exactly what I was looking for. (I knew that had to be a more elegant way then my brute force method) One question though. I was playing around with strsplit but couldn't get it to work, I realised my problem was that I was using "." as the string. I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables and Ripleys book to "(use '\.' to match '.')", which is in the Regular expressions section. I noticed that in the suggestions sent to me people used: strsplit(test,"\\.\\.\\.") Could anyone please explain why I should have used "\\.\\.\\." rather than "\.\.\."? Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP Commercialisation and Innovation, Data Analysis, Modelling and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 ch...@trickysolutions.com.au -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Wednesday, 13 April 2011 10:55 PM To: Chris Howden Cc: r-help@r-project.org Subject: Re: [R] Is there a better way to parse strings than this? On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden wrote: > Hi Everyone, > > > I needed to parse some strings recently. > > The code I've wound up using seems rather clunky, and I was wondering if > anyone had any suggestions on a better way? > > Basically I do the following: > > 1) Use substr() to do the parsing > 2) Use regexpr() to find the location of the string I want to parse on, I > then pass this onto substr() > 3) Use nchar() as the stop input to substr() where necessary > > > > I've got a simple example of the parsing code I used below. It takes > questionnaire variable names that includes the question and the brand it > was answered for and then parses it so the variable name and the brand are > in separate columns. I then use this to restructure the data from > unstacked to stacked, but that's another story. > >> # this is the data set >> test > [1] "A5.Brands.bought...Dulux" > [2] "A5.Brands.bought...Haymes" > [3] "A5.Brands.bought...Solver" > [4] "A5.Brands.bought...Taubmans.or.Bristol" > [5] "A5.Brands.bought...Wattyl" > [6] "A5.Brands.bought...Other" > >> # Where do I want to parse? >> break1 <- regexpr('...',test, fixed=TRUE) >> break1 > [1] 17 17 17 17 17 17 > attr(,"match.length") > [1] 3 3 3 3 3 3 > >> # Put Variable name in a variable >> str1 <- substr(test,1,break1-1) >> str1 > [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" > "A5.Brands.bought" > [5] "A5.Brands.bought" "A5.Brands.bought" > >> # Put Brand name in a variable >> str2 <- substr(test,break1+3, nchar(test)) >> str2 > [1] "Dulux" "Haymes" "Solver" > [4] "Taubmans.or.Bristol" "Wattyl" "Other" > > Try this: > x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes", + "A5.Brands.bought...Solver") > > do.call(rbind, strsplit(x, "...", fixed = TRUE)) [,1] [,2] [1,] "A5.Brands.bought" "Dulux" [2,] "A5.Brands.bought" "Haymes" [3,] "A5.Brands.bought" "Solver" > > # or > xa <- sub("...", "\1", x, fixed = TRUE) > read.table(textConnection(xa), sep = "\1", as.is = TRUE) V1 V2 1 A5.Brands.bought Dulux 2 A5.Brands.bought Haymes 3 A5.Brands.bought Solver -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden wrote: > Hi Everyone, > > > I needed to parse some strings recently. > > The code I've wound up using seems rather clunky, and I was wondering if > anyone had any suggestions on a better way? > > Basically I do the following: > > 1) Use substr() to do the parsing > 2) Use regexpr() to find the location of the string I want to parse on, I > then pass this onto substr() > 3) Use nchar() as the stop input to substr() where necessary > > > > I've got a simple example of the parsing code I used below. It takes > questionnaire variable names that includes the question and the brand it > was answered for and then parses it so the variable name and the brand are > in separate columns. I then use this to restructure the data from > unstacked to stacked, but that's another story. > >> # this is the data set >> test > [1] "A5.Brands.bought...Dulux" > [2] "A5.Brands.bought...Haymes" > [3] "A5.Brands.bought...Solver" > [4] "A5.Brands.bought...Taubmans.or.Bristol" > [5] "A5.Brands.bought...Wattyl" > [6] "A5.Brands.bought...Other" > >> # Where do I want to parse? >> break1 <- regexpr('...',test, fixed=TRUE) >> break1 > [1] 17 17 17 17 17 17 > attr(,"match.length") > [1] 3 3 3 3 3 3 > >> # Put Variable name in a variable >> str1 <- substr(test,1,break1-1) >> str1 > [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" > "A5.Brands.bought" > [5] "A5.Brands.bought" "A5.Brands.bought" > >> # Put Brand name in a variable >> str2 <- substr(test,break1+3, nchar(test)) >> str2 > [1] "Dulux" "Haymes" "Solver" > [4] "Taubmans.or.Bristol" "Wattyl" "Other" > > Try this: > x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes", + "A5.Brands.bought...Solver") > > do.call(rbind, strsplit(x, "...", fixed = TRUE)) [,1] [,2] [1,] "A5.Brands.bought" "Dulux" [2,] "A5.Brands.bought" "Haymes" [3,] "A5.Brands.bought" "Solver" > > # or > xa <- sub("...", "\1", x, fixed = TRUE) > read.table(textConnection(xa), sep = "\1", as.is = TRUE) V1 V2 1 A5.Brands.bought Dulux 2 A5.Brands.bought Haymes 3 A5.Brands.bought Solver -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
On Wed, Apr 13, 2011 at 5:18 AM, Dennis Murphy wrote: > Hi: > > Here's one approach: > > strings <- c( > "A5.Brands.bought...Dulux", > "A5.Brands.bought...Haymes", > "A5.Brands.bought...Solver", > "A5.Brands.bought...Taubmans.or.Bristol", > "A5.Brands.bought...Wattyl", > "A5.Brands.bought...Other") > > slist <- strsplit(strings, '\\.\\.\\.') Or with stringr: library(stringr) str_split_fixed(strings, fixed("..."), n = 2) # or maybe str_match(strings, "(..).*\\.\\.\\.(.*)") Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a better way to parse strings than this?
Hi: Here's one approach: strings <- c( "A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes", "A5.Brands.bought...Solver", "A5.Brands.bought...Taubmans.or.Bristol", "A5.Brands.bought...Wattyl", "A5.Brands.bought...Other") slist <- strsplit(strings, '\\.\\.\\.') # Conversion to data frame: library(plyr) ldply(slist, rbind) V1 V2 1 A5.Brands.bought Dulux 2 A5.Brands.bought Haymes 3 A5.Brands.bought Solver 4 A5.Brands.bought Taubmans.or.Bristol 5 A5.Brands.bought Wattyl 6 A5.Brands.bought Other # Conversion to matrix: laply(slist, rbind) do.call(rbind, slist) ...and one can subselect from there. HTH, Dennis On Tue, Apr 12, 2011 at 9:07 PM, Chris Howden wrote: > Hi Everyone, > > > I needed to parse some strings recently. > > The code I've wound up using seems rather clunky, and I was wondering if > anyone had any suggestions on a better way? > > Basically I do the following: > > 1) Use substr() to do the parsing > 2) Use regexpr() to find the location of the string I want to parse on, I > then pass this onto substr() > 3) Use nchar() as the stop input to substr() where necessary > > > > I've got a simple example of the parsing code I used below. It takes > questionnaire variable names that includes the question and the brand it > was answered for and then parses it so the variable name and the brand are > in separate columns. I then use this to restructure the data from > unstacked to stacked, but that's another story. > > > # this is the data set > > test > [1] "A5.Brands.bought...Dulux" > [2] "A5.Brands.bought...Haymes" > [3] "A5.Brands.bought...Solver" > [4] "A5.Brands.bought...Taubmans.or.Bristol" > [5] "A5.Brands.bought...Wattyl" > [6] "A5.Brands.bought...Other" > > > # Where do I want to parse? > > break1 <- regexpr('...',test, fixed=TRUE) > > break1 > [1] 17 17 17 17 17 17 > attr(,"match.length") > [1] 3 3 3 3 3 3 > > > # Put Variable name in a variable > > str1 <- substr(test,1,break1-1) > > str1 > [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" > "A5.Brands.bought" > [5] "A5.Brands.bought" "A5.Brands.bought" > > > # Put Brand name in a variable > > str2 <- substr(test,break1+3, nchar(test)) > > str2 > [1] "Dulux" "Haymes" "Solver" > [4] "Taubmans.or.Bristol" "Wattyl" "Other" > > > > Thanks for any and all suggestions > > > Chris Howden > Founding Partner > Tricky Solutions > Tricky Solutions 4 Tricky Problems > Evidence Based Strategic Development, IP Commercialisation and Innovation, > Data Analysis, Modelling and Training > (mobile) 0410 689 945 > (fax / office) (+618) 8952 7878 > ch...@trickysolutions.com.au > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Is there a better way to parse strings than this?
Hi Everyone, I needed to parse some strings recently. The code I've wound up using seems rather clunky, and I was wondering if anyone had any suggestions on a better way? Basically I do the following: 1) Use substr() to do the parsing 2) Use regexpr() to find the location of the string I want to parse on, I then pass this onto substr() 3) Use nchar() as the stop input to substr() where necessary I've got a simple example of the parsing code I used below. It takes questionnaire variable names that includes the question and the brand it was answered for and then parses it so the variable name and the brand are in separate columns. I then use this to restructure the data from unstacked to stacked, but that's another story. > # this is the data set > test [1] "A5.Brands.bought...Dulux" [2] "A5.Brands.bought...Haymes" [3] "A5.Brands.bought...Solver" [4] "A5.Brands.bought...Taubmans.or.Bristol" [5] "A5.Brands.bought...Wattyl" [6] "A5.Brands.bought...Other" > # Where do I want to parse? > break1 <- regexpr('...',test, fixed=TRUE) > break1 [1] 17 17 17 17 17 17 attr(,"match.length") [1] 3 3 3 3 3 3 > # Put Variable name in a variable > str1 <- substr(test,1,break1-1) > str1 [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought" [5] "A5.Brands.bought" "A5.Brands.bought" > # Put Brand name in a variable > str2 <- substr(test,break1+3, nchar(test)) > str2 [1] "Dulux" "Haymes" "Solver" [4] "Taubmans.or.Bristol" "Wattyl" "Other" Thanks for any and all suggestions Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP Commercialisation and Innovation, Data Analysis, Modelling and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 ch...@trickysolutions.com.au __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.