Re: [R] extracting characters from a string

2013-01-23 Thread arun


HI David,


It could be related to spaces in the data or something else.  
Suppose, if the data has some spaces at the end or the beginning.
pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D ')

pubnew<-rbind(pub1, pub2, pub3)
res<-as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("^ | 
$","",gsub("[A-Za-z]+$","",gsub(" $","",x),stringsAsFactors=F)
str(res)
#'data.frame':    3 obs. of  4 variables:
# $ V1: chr  "Brown" "Benigni" "Arstra"
# $ V2: chr  "Santos" "" "Van den Hoops"
# $ V3: chr  "Rome" "" "lamarque"
# $ V4: chr  "Don Juan" "" ""



#If I used the previous solution:
as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ 
|\\w+$","",x,stringsAsFactors=F)
   V1    V2 V3   V4
1   Brown    Santos   Rome Don Juan
2 Benigni  
3  Arstra Van den Hoops lamarque D  # initial present.

I tried this case with Rui's solution:
fun2(pubnew)
#[[1]]
#[1] " Brown"   "Santos"   "Rome" "Don Juan"

#[[2]]
#[1] "Benigni"
#
#[[3]]
#[1] "Arstra"    "Van den Hoops" "lamarque D"   # tinitials present.

As Rui's solution works for you, the problem might be something else.
A.K.


   




From: Biau David 
To: arun  
Sent: Thursday, January 24, 2013 12:40 AM
Subject: Re: [R] extracting characters from a string


thanks a lot. it doesn't entirely work well yet; poabably because of the format 
of the data I import. I have to look into it and thanks to your explanation, I 
should be able to find the problem in the data.



David


>
> De : arun 
>À : Biau David  
>Envoyé le : Mercredi 23 janvier 2013 19h06
>Objet : Re: [R] extracting characters from a string
> 
>Hi David,
>
>I forgot about the explanation part.
>dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) # here, I 
>converted it to dataframe, delimited by ",", Used fill=TRUE because you have 
>unequal number of publications in each line
>as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ 
>|\\w+$","",x,stringsAsFactors=F)
>
>#splitting codes into smaller pieces;
> lapply(dat1,function(x) gsub("^ |\\w+$","",x)) #lapply() will ensure that the 
>columns in dataframe are split to list elements.  Here, the gsub command 
>within first double quotes matches if there are any empty spaces at the start 
>of the string and also the last word characters in each string and removes 
>them ( 2nd set of double quotes are
empty).
>$V1
>[1] "Brown "   "Benigni " "Arstra " 
>
>$V2
>[1] "Santos "    ""   "Van den Hoops "
>
>$V3
>[1] "Rome " ""  "lamarque "
>
>$V4
>[1] "Don Juan " ""  "" 
>lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x))) # I used a second 
>gsub because there are some spaces at the end e.g. "Brown "
>$V1
>[1] "Brown"   "Benigni" "Arstra" 
>
>$V2
>[1] "Santos"    ""  "Van den Hoops"
>
>$V3
>[1] "Rome" ""
"lamarque"
>
>$V4
>[1] "Don Juan" "" ""    
>
>do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x 
>#bind by columns
> V1    V2  V3 V4    
>[1,] "Brown"   "Santos"    "Rome" "Don Juan"
>[2,] "Benigni" ""  "" ""    
>[3,] "Arstra"  "Van den Hoops" "lamarque" ""    
>
>Hope it
helps.
>A.K.
>
>
>
>
>
>
>
>
>
>
>
>- Original Message -
>From: Biau David 
>To: r help list 
>Cc: 
>Sent: Wednesday, January 23, 2013 12:38 PM
>Subject: [R] extracting characters from a string
>
>Dear All,
>
>I have a data frame of vectors of publication names such as 'pub':
>
>pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
>pub2 <- c('Benigni D')
>pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
>pub <- rbind(pub1, pub2, pub3)
>
>
>I would like to construct a dataframe with only author's last name and each 
>last name in columns and the publication in rows. Basically I want to get rid 
>of the initials (max 2, always before a comma) and spaces surounding last
name. I would like to avoid a loop.
>
>ps: If I could have even a short explanation of the code that extract the 
>values of the character string that would also be great!
>
> 
>David
>
>    [[alternative HTML version deleted]]
>
>
>__
>R-help@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extracting characters from a string

2013-01-23 Thread Biau David
thanks, it works well. I have to work on Arun's previous answer to make it work 
too.


 
David


>
> De : Rui Barradas 
>À : Biau David  
>Cc : r help list  
>Envoyé le : Mercredi 23 janvier 2013 19h57
>Objet : Re: [R] extracting characters from a string
> 
>Hello,
>
>I've just noticed that my first solution would only return the first set 
>of alphabetic characters, such as "Van", not "Van den Hoops".
>The following will solve that problem.
>
>
>fun2 <- function(x, sep = ", "){
>    x <- strsplit(x, sep)
>    m <- lapply(x, function(y) gregexpr(" [[:alpha:]]*$", y))
>    res <- lapply(seq_along(x), function(i)
>        regmatches(x[[i]], m[[i]], invert = TRUE))
>    res <- lapply(res, unlist)
>    lapply(res, function(y) y[nchar(y) > 0])
>}
>fun2(pub)
>
>
>Hope this helps,
>
>Rui Barradas
>
>Em 23-01-2013 18:33, Rui Barradas escreveu:
>> Hello,
>>
>> Try the following.
>>
>> fun <- function(x, sep = ", "){
>>      s <- unlist(strsplit(x, sep))
>>      regmatches(s, regexpr("[[:alpha:]]*", s))
>> }
>>
>> fun(pub)
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> Em 23-01-2013 17:38, Biau David escreveu:
>>> Dear All,
>>>
>>> I have a data frame of vectors of publication names such as 'pub':
>>>
>>> pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
>>> pub2 <- c('Benigni D')
>>> pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>>>
>>> pub <- rbind(pub1, pub2, pub3)
>>>
>>>
>>> I would like to construct a dataframe with only author's last name and
>>> each last name in columns and the publication in rows. Basically I
>>> want to get rid of the initials (max 2, always before a comma) and
>>> spaces surounding last name. I would like to avoid a loop.
>>>
>>> ps: If I could have even a short explanation of the code that extract
>>> the values of the character string that would also be great!
>>>
>>>
>>> David
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extracting characters from a string

2013-01-23 Thread arun
Hi,
You could try this:
dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F)
dat2<- as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ 
|\\w+$","",x,stringsAsFactors=F)


 dat2
#    V1  V2 V3 V4
#1   Brown  Santos   Rome   Don Juan 
#2 Benigni   
#3  Arstra   Van den Hoops   lamarque   
A.K.

- Original Message -
From: Biau David 
To: r help list 
Cc: 
Sent: Wednesday, January 23, 2013 12:38 PM
Subject: [R] extracting characters from a string

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and each 
last name in columns and the publication in rows. Basically I want to get rid 
of the initials (max 2, always before a comma) and spaces surounding last name. 
I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the 
values of the character string that would also be great!

 
David

    [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extracting characters from a string

2013-01-23 Thread Rui Barradas

Hello,

I've just noticed that my first solution would only return the first set 
of alphabetic characters, such as "Van", not "Van den Hoops".

The following will solve that problem.


fun2 <- function(x, sep = ", "){
x <- strsplit(x, sep)
m <- lapply(x, function(y) gregexpr(" [[:alpha:]]*$", y))
res <- lapply(seq_along(x), function(i)
regmatches(x[[i]], m[[i]], invert = TRUE))
res <- lapply(res, unlist)
lapply(res, function(y) y[nchar(y) > 0])
}
fun2(pub)


Hope this helps,

Rui Barradas

Em 23-01-2013 18:33, Rui Barradas escreveu:

Hello,

Try the following.

fun <- function(x, sep = ", "){
 s <- unlist(strsplit(x, sep))
 regmatches(s, regexpr("[[:alpha:]]*", s))
}

fun(pub)


Hope this helps,

Rui Barradas

Em 23-01-2013 17:38, Biau David escreveu:

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and
each last name in columns and the publication in rows. Basically I
want to get rid of the initials (max 2, always before a comma) and
spaces surounding last name. I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract
the values of the character string that would also be great!


David

[[alternative HTML version deleted]]



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extracting characters from a string

2013-01-23 Thread Rui Barradas

Hello,

Try the following.

fun <- function(x, sep = ", "){
s <- unlist(strsplit(x, sep))
regmatches(s, regexpr("[[:alpha:]]*", s))
}

fun(pub)


Hope this helps,

Rui Barradas

Em 23-01-2013 17:38, Biau David escreveu:

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and each 
last name in columns and the publication in rows. Basically I want to get rid 
of the initials (max 2, always before a comma) and spaces surounding last name. 
I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the 
values of the character string that would also be great!


David

[[alternative HTML version deleted]]



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extracting characters from a string

2013-01-23 Thread Bert Gunter
1. Study a regular expression tutorial on the web to learn how to do this.

2. ?regex in R summarizes (tersely! -- but clearly) R's regex's.

3. ?grep tells you about R's regular expression manipulation functions.

-- Bert

On Wed, Jan 23, 2013 at 9:38 AM, Biau David  wrote:
> Dear All,
>
> I have a data frame of vectors of publication names such as 'pub':
>
> pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
> pub2 <- c('Benigni D')
> pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
> pub <- rbind(pub1, pub2, pub3)
>
>
> I would like to construct a dataframe with only author's last name and each 
> last name in columns and the publication in rows. Basically I want to get rid 
> of the initials (max 2, always before a comma) and spaces surounding last 
> name. I would like to avoid a loop.
>
> ps: If I could have even a short explanation of the code that extract the 
> values of the character string that would also be great!
>
>
> David
>
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] extracting characters from a string

2013-01-23 Thread Biau David
Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and each 
last name in columns and the publication in rows. Basically I want to get rid 
of the initials (max 2, always before a comma) and spaces surounding last name. 
I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the 
values of the character string that would also be great!

 
David

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.