Re: [R] Is there a better way to parse strings than this?

2011-04-18 Thread Chris Howden
Thanks for the explanation,

I think I understand it now. So to paraphrase all your explanations

To match "." in a regular expression then the string "\.\.\." needs to be
passed to it. This tells it to escape the special meaning of ".". But in
order to get the \ into the string being passed to the function I also
need to escape its special meaning, so I need to use "\\.\\.\\."



Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
ch...@trickysolutions.com.au


-Original Message-
From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley
Wickham
Sent: Friday, 15 April 2011 11:07 AM
To: Chris Howden
Cc: r-help@r-project.org
Subject: Re: [R] Is there a better way to parse strings than this?

> I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
> and Ripleys book to "(use '\.' to match '.')", which is in the Regular
> expressions section.
>
> I noticed that in the suggestions sent to me people used:
> strsplit(test,"\\.\\.\\.")
>
>
> Could anyone please explain why I should have used "\\.\\.\\." rather
than
> "\.\.\."?

Basically,

 * you want to match .
 * so the regular expression you need is \.
 * and the way you represent that in a string in R is \\.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-14 Thread Whit Armstrong
not everything has to be done in R.

awk and sed are some of the best tools on a linux/unix box.

quick refs:
http://www.pement.org/awk/awk1line.txt
http://sed.sourceforge.net/sed1line.txt

-Whit


On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
 wrote:
> Hi Everyone,
>
>
> I needed to parse some strings recently.
>
> The code I've wound up using seems rather clunky, and I was wondering if
> anyone had any suggestions on a better way?
>
> Basically I do the following:
>
> 1) Use substr() to do the parsing
> 2) Use regexpr() to find the location of the string I want to parse on, I
> then pass this onto substr()
> 3) Use nchar() as the stop input to substr() where necessary
>
>
>
> I've got a simple example of the parsing code I used below. It takes
> questionnaire variable names that includes the question and the brand it
> was answered for and then parses it so the variable name and the brand are
> in separate columns. I then use this to restructure the data from
> unstacked to stacked, but that's another story.
>
>> # this is the data set
>> test
> [1] "A5.Brands.bought...Dulux"
> [2] "A5.Brands.bought...Haymes"
> [3] "A5.Brands.bought...Solver"
> [4] "A5.Brands.bought...Taubmans.or.Bristol"
> [5] "A5.Brands.bought...Wattyl"
> [6] "A5.Brands.bought...Other"
>
>> # Where do I want to parse?
>> break1 <-  regexpr('...',test, fixed=TRUE)
>> break1
> [1] 17 17 17 17 17 17
> attr(,"match.length")
> [1] 3 3 3 3 3 3
>
>> # Put Variable name in a variable
>> str1 <- substr(test,1,break1-1)
>> str1
> [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
> "A5.Brands.bought"
> [5] "A5.Brands.bought" "A5.Brands.bought"
>
>> # Put Brand name in a variable
>> str2 <- substr(test,break1+3, nchar(test))
>> str2
> [1] "Dulux"               "Haymes"              "Solver"
> [4] "Taubmans.or.Bristol" "Wattyl"              "Other"
>
>
>
> Thanks for any and all suggestions
>
>
> Chris Howden
> Founding Partner
> Tricky Solutions
> Tricky Solutions 4 Tricky Problems
> Evidence Based Strategic Development, IP Commercialisation and Innovation,
> Data Analysis, Modelling and Training
> (mobile) 0410 689 945
> (fax / office) (+618) 8952 7878
> ch...@trickysolutions.com.au
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-14 Thread Gabor Grothendieck
On Thu, Apr 14, 2011 at 8:28 PM, Chris Howden
 wrote:
> Thanks for the suggestions, they were all exactly what I was looking for.
> (I knew that had to be a more elegant way then my brute force method)
>
> One question though.
>
> I was playing around with strsplit but couldn't get it to work, I realised
> my problem was that I was using "." as the string.
>
> I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
> and Ripleys book to "(use '\.' to match '.')", which is in the Regular
> expressions section.
>
> I noticed that in the suggestions sent to me people used:
> strsplit(test,"\\.\\.\\.")
>
>
> Could anyone please explain why I should have used "\\.\\.\\." rather than
> "\.\.\."?
>

"\\.\\.\\." is the string \.\.\.   For example, try this

> cat("\\.\\.\\.\n")
\.\.\.



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-14 Thread Hadley Wickham
> I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
> and Ripleys book to "(use '\.' to match '.')", which is in the Regular
> expressions section.
>
> I noticed that in the suggestions sent to me people used:
> strsplit(test,"\\.\\.\\.")
>
>
> Could anyone please explain why I should have used "\\.\\.\\." rather than
> "\.\.\."?

Basically,

 * you want to match .
 * so the regular expression you need is \.
 * and the way you represent that in a string in R is \\.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-14 Thread Chris Howden
Thanks for the suggestions, they were all exactly what I was looking for.
(I knew that had to be a more elegant way then my brute force method)

One question though.

I was playing around with strsplit but couldn't get it to work, I realised
my problem was that I was using "." as the string.

I was trying strsplit(string,"\.\.\.") as per the suggestion in Venables
and Ripleys book to "(use '\.' to match '.')", which is in the Regular
expressions section.

I noticed that in the suggestions sent to me people used:
strsplit(test,"\\.\\.\\.")


Could anyone please explain why I should have used "\\.\\.\\." rather than
"\.\.\."?



Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
ch...@trickysolutions.com.au


-Original Message-
From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com]
Sent: Wednesday, 13 April 2011 10:55 PM
To: Chris Howden
Cc: r-help@r-project.org
Subject: Re: [R] Is there a better way to parse strings than this?

On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
 wrote:
> Hi Everyone,
>
>
> I needed to parse some strings recently.
>
> The code I've wound up using seems rather clunky, and I was wondering if
> anyone had any suggestions on a better way?
>
> Basically I do the following:
>
> 1) Use substr() to do the parsing
> 2) Use regexpr() to find the location of the string I want to parse on,
I
> then pass this onto substr()
> 3) Use nchar() as the stop input to substr() where necessary
>
>
>
> I've got a simple example of the parsing code I used below. It takes
> questionnaire variable names that includes the question and the brand it
> was answered for and then parses it so the variable name and the brand
are
> in separate columns. I then use this to restructure the data from
> unstacked to stacked, but that's another story.
>
>> # this is the data set
>> test
> [1] "A5.Brands.bought...Dulux"
> [2] "A5.Brands.bought...Haymes"
> [3] "A5.Brands.bought...Solver"
> [4] "A5.Brands.bought...Taubmans.or.Bristol"
> [5] "A5.Brands.bought...Wattyl"
> [6] "A5.Brands.bought...Other"
>
>> # Where do I want to parse?
>> break1 <-  regexpr('...',test, fixed=TRUE)
>> break1
> [1] 17 17 17 17 17 17
> attr(,"match.length")
> [1] 3 3 3 3 3 3
>
>> # Put Variable name in a variable
>> str1 <- substr(test,1,break1-1)
>> str1
> [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
> "A5.Brands.bought"
> [5] "A5.Brands.bought" "A5.Brands.bought"
>
>> # Put Brand name in a variable
>> str2 <- substr(test,break1+3, nchar(test))
>> str2
> [1] "Dulux"               "Haymes"              "Solver"
> [4] "Taubmans.or.Bristol" "Wattyl"              "Other"
>
>

Try this:

> x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes",
+ "A5.Brands.bought...Solver")
>
> do.call(rbind, strsplit(x, "...", fixed = TRUE))
 [,1]   [,2]
[1,] "A5.Brands.bought" "Dulux"
[2,] "A5.Brands.bought" "Haymes"
[3,] "A5.Brands.bought" "Solver"
>
> # or
> xa <- sub("...", "\1", x, fixed = TRUE)
> read.table(textConnection(xa), sep = "\1", as.is = TRUE)
V1 V2
1 A5.Brands.bought  Dulux
2 A5.Brands.bought Haymes
3 A5.Brands.bought Solver


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-13 Thread Gabor Grothendieck
On Wed, Apr 13, 2011 at 12:07 AM, Chris Howden
 wrote:
> Hi Everyone,
>
>
> I needed to parse some strings recently.
>
> The code I've wound up using seems rather clunky, and I was wondering if
> anyone had any suggestions on a better way?
>
> Basically I do the following:
>
> 1) Use substr() to do the parsing
> 2) Use regexpr() to find the location of the string I want to parse on, I
> then pass this onto substr()
> 3) Use nchar() as the stop input to substr() where necessary
>
>
>
> I've got a simple example of the parsing code I used below. It takes
> questionnaire variable names that includes the question and the brand it
> was answered for and then parses it so the variable name and the brand are
> in separate columns. I then use this to restructure the data from
> unstacked to stacked, but that's another story.
>
>> # this is the data set
>> test
> [1] "A5.Brands.bought...Dulux"
> [2] "A5.Brands.bought...Haymes"
> [3] "A5.Brands.bought...Solver"
> [4] "A5.Brands.bought...Taubmans.or.Bristol"
> [5] "A5.Brands.bought...Wattyl"
> [6] "A5.Brands.bought...Other"
>
>> # Where do I want to parse?
>> break1 <-  regexpr('...',test, fixed=TRUE)
>> break1
> [1] 17 17 17 17 17 17
> attr(,"match.length")
> [1] 3 3 3 3 3 3
>
>> # Put Variable name in a variable
>> str1 <- substr(test,1,break1-1)
>> str1
> [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
> "A5.Brands.bought"
> [5] "A5.Brands.bought" "A5.Brands.bought"
>
>> # Put Brand name in a variable
>> str2 <- substr(test,break1+3, nchar(test))
>> str2
> [1] "Dulux"               "Haymes"              "Solver"
> [4] "Taubmans.or.Bristol" "Wattyl"              "Other"
>
>

Try this:

> x <- c("A5.Brands.bought...Dulux", "A5.Brands.bought...Haymes",
+ "A5.Brands.bought...Solver")
>
> do.call(rbind, strsplit(x, "...", fixed = TRUE))
 [,1]   [,2]
[1,] "A5.Brands.bought" "Dulux"
[2,] "A5.Brands.bought" "Haymes"
[3,] "A5.Brands.bought" "Solver"
>
> # or
> xa <- sub("...", "\1", x, fixed = TRUE)
> read.table(textConnection(xa), sep = "\1", as.is = TRUE)
V1 V2
1 A5.Brands.bought  Dulux
2 A5.Brands.bought Haymes
3 A5.Brands.bought Solver


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-13 Thread Hadley Wickham
On Wed, Apr 13, 2011 at 5:18 AM, Dennis Murphy  wrote:
> Hi:
>
> Here's one approach:
>
> strings <- c(
> "A5.Brands.bought...Dulux",
> "A5.Brands.bought...Haymes",
> "A5.Brands.bought...Solver",
> "A5.Brands.bought...Taubmans.or.Bristol",
> "A5.Brands.bought...Wattyl",
> "A5.Brands.bought...Other")
>
> slist <- strsplit(strings, '\\.\\.\\.')

Or with stringr:

library(stringr)
str_split_fixed(strings, fixed("..."), n = 2)

# or maybe
str_match(strings, "(..).*\\.\\.\\.(.*)")

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a better way to parse strings than this?

2011-04-13 Thread Dennis Murphy
Hi:

Here's one approach:

strings <- c(
"A5.Brands.bought...Dulux",
"A5.Brands.bought...Haymes",
"A5.Brands.bought...Solver",
"A5.Brands.bought...Taubmans.or.Bristol",
"A5.Brands.bought...Wattyl",
"A5.Brands.bought...Other")

slist <- strsplit(strings, '\\.\\.\\.')

# Conversion to data frame:
library(plyr)
ldply(slist, rbind)
V1  V2
1 A5.Brands.bought   Dulux
2 A5.Brands.bought  Haymes
3 A5.Brands.bought  Solver
4 A5.Brands.bought Taubmans.or.Bristol
5 A5.Brands.bought  Wattyl
6 A5.Brands.bought   Other

# Conversion to matrix:
laply(slist, rbind)
do.call(rbind, slist)

...and one can subselect from there.

HTH,
Dennis


On Tue, Apr 12, 2011 at 9:07 PM, Chris Howden
wrote:

> Hi Everyone,
>
>
> I needed to parse some strings recently.
>
> The code I've wound up using seems rather clunky, and I was wondering if
> anyone had any suggestions on a better way?
>
> Basically I do the following:
>
> 1) Use substr() to do the parsing
> 2) Use regexpr() to find the location of the string I want to parse on, I
> then pass this onto substr()
> 3) Use nchar() as the stop input to substr() where necessary
>
>
>
> I've got a simple example of the parsing code I used below. It takes
> questionnaire variable names that includes the question and the brand it
> was answered for and then parses it so the variable name and the brand are
> in separate columns. I then use this to restructure the data from
> unstacked to stacked, but that's another story.
>
> > # this is the data set
> > test
> [1] "A5.Brands.bought...Dulux"
> [2] "A5.Brands.bought...Haymes"
> [3] "A5.Brands.bought...Solver"
> [4] "A5.Brands.bought...Taubmans.or.Bristol"
> [5] "A5.Brands.bought...Wattyl"
> [6] "A5.Brands.bought...Other"
>
> > # Where do I want to parse?
> > break1 <-  regexpr('...',test, fixed=TRUE)
> > break1
> [1] 17 17 17 17 17 17
> attr(,"match.length")
> [1] 3 3 3 3 3 3
>
> > # Put Variable name in a variable
> > str1 <- substr(test,1,break1-1)
> > str1
> [1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
> "A5.Brands.bought"
> [5] "A5.Brands.bought" "A5.Brands.bought"
>
> > # Put Brand name in a variable
> > str2 <- substr(test,break1+3, nchar(test))
> > str2
> [1] "Dulux"   "Haymes"  "Solver"
> [4] "Taubmans.or.Bristol" "Wattyl"  "Other"
>
>
>
> Thanks for any and all suggestions
>
>
> Chris Howden
> Founding Partner
> Tricky Solutions
> Tricky Solutions 4 Tricky Problems
> Evidence Based Strategic Development, IP Commercialisation and Innovation,
> Data Analysis, Modelling and Training
> (mobile) 0410 689 945
> (fax / office) (+618) 8952 7878
> ch...@trickysolutions.com.au
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Is there a better way to parse strings than this?

2011-04-12 Thread Chris Howden
Hi Everyone,


I needed to parse some strings recently.

The code I've wound up using seems rather clunky, and I was wondering if
anyone had any suggestions on a better way?

Basically I do the following:

1) Use substr() to do the parsing
2) Use regexpr() to find the location of the string I want to parse on, I
then pass this onto substr()
3) Use nchar() as the stop input to substr() where necessary



I've got a simple example of the parsing code I used below. It takes
questionnaire variable names that includes the question and the brand it
was answered for and then parses it so the variable name and the brand are
in separate columns. I then use this to restructure the data from
unstacked to stacked, but that's another story.

> # this is the data set
> test
[1] "A5.Brands.bought...Dulux"
[2] "A5.Brands.bought...Haymes"
[3] "A5.Brands.bought...Solver"
[4] "A5.Brands.bought...Taubmans.or.Bristol"
[5] "A5.Brands.bought...Wattyl"
[6] "A5.Brands.bought...Other"

> # Where do I want to parse?
> break1 <-  regexpr('...',test, fixed=TRUE)
> break1
[1] 17 17 17 17 17 17
attr(,"match.length")
[1] 3 3 3 3 3 3

> # Put Variable name in a variable
> str1 <- substr(test,1,break1-1)
> str1
[1] "A5.Brands.bought" "A5.Brands.bought" "A5.Brands.bought"
"A5.Brands.bought"
[5] "A5.Brands.bought" "A5.Brands.bought"

> # Put Brand name in a variable
> str2 <- substr(test,break1+3, nchar(test))
> str2
[1] "Dulux"   "Haymes"  "Solver"
[4] "Taubmans.or.Bristol" "Wattyl"  "Other"



Thanks for any and all suggestions


Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
ch...@trickysolutions.com.au

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.