Re: [R] extracting characters from a string
HI David, It could be related to spaces in the data or something else. Suppose, if the data has some spaces at the end or the beginning. pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D ') pubnew<-rbind(pub1, pub2, pub3) res<-as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("^ | $","",gsub("[A-Za-z]+$","",gsub(" $","",x),stringsAsFactors=F) str(res) #'data.frame': 3 obs. of 4 variables: # $ V1: chr "Brown" "Benigni" "Arstra" # $ V2: chr "Santos" "" "Van den Hoops" # $ V3: chr "Rome" "" "lamarque" # $ V4: chr "Don Juan" "" "" #If I used the previous solution: as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x,stringsAsFactors=F) V1 V2 V3 V4 1 Brown Santos Rome Don Juan 2 Benigni 3 Arstra Van den Hoops lamarque D # initial present. I tried this case with Rui's solution: fun2(pubnew) #[[1]] #[1] " Brown" "Santos" "Rome" "Don Juan" #[[2]] #[1] "Benigni" # #[[3]] #[1] "Arstra" "Van den Hoops" "lamarque D" # tinitials present. As Rui's solution works for you, the problem might be something else. A.K. From: Biau David To: arun Sent: Thursday, January 24, 2013 12:40 AM Subject: Re: [R] extracting characters from a string thanks a lot. it doesn't entirely work well yet; poabably because of the format of the data I import. I have to look into it and thanks to your explanation, I should be able to find the problem in the data. David > > De : arun >À : Biau David >Envoyé le : Mercredi 23 janvier 2013 19h06 >Objet : Re: [R] extracting characters from a string > >Hi David, > >I forgot about the explanation part. >dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) # here, I >converted it to dataframe, delimited by ",", Used fill=TRUE because you have >unequal number of publications in each line >as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ >|\\w+$","",x,stringsAsFactors=F) > >#splitting codes into smaller pieces; > lapply(dat1,function(x) gsub("^ |\\w+$","",x)) #lapply() will ensure that the >columns in dataframe are split to list elements. Here, the gsub command >within first double quotes matches if there are any empty spaces at the start >of the string and also the last word characters in each string and removes >them ( 2nd set of double quotes are empty). >$V1 >[1] "Brown " "Benigni " "Arstra " > >$V2 >[1] "Santos " "" "Van den Hoops " > >$V3 >[1] "Rome " "" "lamarque " > >$V4 >[1] "Don Juan " "" "" >lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x))) # I used a second >gsub because there are some spaces at the end e.g. "Brown " >$V1 >[1] "Brown" "Benigni" "Arstra" > >$V2 >[1] "Santos" "" "Van den Hoops" > >$V3 >[1] "Rome" "" "lamarque" > >$V4 >[1] "Don Juan" "" "" > >do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x >#bind by columns > V1 V2 V3 V4 >[1,] "Brown" "Santos" "Rome" "Don Juan" >[2,] "Benigni" "" "" "" >[3,] "Arstra" "Van den Hoops" "lamarque" "" > >Hope it helps. >A.K. > > > > > > > > > > > >- Original Message - >From: Biau David >To: r help list >Cc: >Sent: Wednesday, January 23, 2013 12:38 PM >Subject: [R] extracting characters from a string > >Dear All, > >I have a data frame of vectors of publication names such as 'pub': > >pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') >pub2 <- c('Benigni D') >pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') > >pub <- rbind(pub1, pub2, pub3) > > >I would like to construct a dataframe with only author's last name and each >last name in columns and the publication in rows. Basically I want to get rid >of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. > >ps: If I could have even a short explanation of the code that extract the >values of the character string that would also be great! > > >David > > [[alternative HTML version deleted]] > > >__ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.r-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extracting characters from a string
thanks, it works well. I have to work on Arun's previous answer to make it work too. David > > De : Rui Barradas >À : Biau David >Cc : r help list >Envoyé le : Mercredi 23 janvier 2013 19h57 >Objet : Re: [R] extracting characters from a string > >Hello, > >I've just noticed that my first solution would only return the first set >of alphabetic characters, such as "Van", not "Van den Hoops". >The following will solve that problem. > > >fun2 <- function(x, sep = ", "){ > x <- strsplit(x, sep) > m <- lapply(x, function(y) gregexpr(" [[:alpha:]]*$", y)) > res <- lapply(seq_along(x), function(i) > regmatches(x[[i]], m[[i]], invert = TRUE)) > res <- lapply(res, unlist) > lapply(res, function(y) y[nchar(y) > 0]) >} >fun2(pub) > > >Hope this helps, > >Rui Barradas > >Em 23-01-2013 18:33, Rui Barradas escreveu: >> Hello, >> >> Try the following. >> >> fun <- function(x, sep = ", "){ >> s <- unlist(strsplit(x, sep)) >> regmatches(s, regexpr("[[:alpha:]]*", s)) >> } >> >> fun(pub) >> >> >> Hope this helps, >> >> Rui Barradas >> >> Em 23-01-2013 17:38, Biau David escreveu: >>> Dear All, >>> >>> I have a data frame of vectors of publication names such as 'pub': >>> >>> pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') >>> pub2 <- c('Benigni D') >>> pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') >>> >>> pub <- rbind(pub1, pub2, pub3) >>> >>> >>> I would like to construct a dataframe with only author's last name and >>> each last name in columns and the publication in rows. Basically I >>> want to get rid of the initials (max 2, always before a comma) and >>> spaces surounding last name. I would like to avoid a loop. >>> >>> ps: If I could have even a short explanation of the code that extract >>> the values of the character string that would also be great! >>> >>> >>> David >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> __ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extracting characters from a string
Hi, You could try this: dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) dat2<- as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x,stringsAsFactors=F) dat2 # V1 V2 V3 V4 #1 Brown Santos Rome Don Juan #2 Benigni #3 Arstra Van den Hoops lamarque A.K. - Original Message - From: Biau David To: r help list Cc: Sent: Wednesday, January 23, 2013 12:38 PM Subject: [R] extracting characters from a string Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! David [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extracting characters from a string
Hello, I've just noticed that my first solution would only return the first set of alphabetic characters, such as "Van", not "Van den Hoops". The following will solve that problem. fun2 <- function(x, sep = ", "){ x <- strsplit(x, sep) m <- lapply(x, function(y) gregexpr(" [[:alpha:]]*$", y)) res <- lapply(seq_along(x), function(i) regmatches(x[[i]], m[[i]], invert = TRUE)) res <- lapply(res, unlist) lapply(res, function(y) y[nchar(y) > 0]) } fun2(pub) Hope this helps, Rui Barradas Em 23-01-2013 18:33, Rui Barradas escreveu: Hello, Try the following. fun <- function(x, sep = ", "){ s <- unlist(strsplit(x, sep)) regmatches(s, regexpr("[[:alpha:]]*", s)) } fun(pub) Hope this helps, Rui Barradas Em 23-01-2013 17:38, Biau David escreveu: Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! David [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extracting characters from a string
Hello, Try the following. fun <- function(x, sep = ", "){ s <- unlist(strsplit(x, sep)) regmatches(s, regexpr("[[:alpha:]]*", s)) } fun(pub) Hope this helps, Rui Barradas Em 23-01-2013 17:38, Biau David escreveu: Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! David [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extracting characters from a string
1. Study a regular expression tutorial on the web to learn how to do this. 2. ?regex in R summarizes (tersely! -- but clearly) R's regex's. 3. ?grep tells you about R's regular expression manipulation functions. -- Bert On Wed, Jan 23, 2013 at 9:38 AM, Biau David wrote: > Dear All, > > I have a data frame of vectors of publication names such as 'pub': > > pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') > pub2 <- c('Benigni D') > pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') > > pub <- rbind(pub1, pub2, pub3) > > > I would like to construct a dataframe with only author's last name and each > last name in columns and the publication in rows. Basically I want to get rid > of the initials (max 2, always before a comma) and spaces surounding last > name. I would like to avoid a loop. > > ps: If I could have even a short explanation of the code that extract the > values of the character string that would also be great! > > > David > > [[alternative HTML version deleted]] > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] extracting characters from a string
Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! David [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.