Hi Dimitri, No problem. I noticed that it is slower with more number of rows. You could use data.table(). ##1e6 rows l1<- letters[1:10] s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse="")) set.seed(24) x2<-data.frame(x=paste(paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),paste0(sample(s1,1e6,replace=TRUE),sample(1:15,1e6,replace=TRUE)),sep="_"),stringsAsFactors=FALSE) system.time(resNew2<-data.frame(x=x2,read.table(text=gsub("[A-Za-z]","",x2[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE)) # # user system elapsed #363.383 0.036 364.153
library(data.table) dt2<- data.table(x2) system.time({ dt2[,xNew:= gsub("[A-Za-z]","",x),] dt2[,V1:=unlist(strsplit(xNew,split="_"))[[1]],by=xNew] dt2[,V2:=unlist(strsplit(xNew,split="_"))[[2]],by=xNew] dt2[,V3:=unlist(strsplit(xNew,split="_"))[[3]],by=xNew] dt3<- subset(dt2,select=-2) }) # user system elapsed # 3.076 0.004 3.085 dim(resNew2) #[1] 1000000 4 dim(dt3) #[1] 1000000 4 head(resNew2) # x V1 V2 V3 #1 ccc12_ccc3_ggg8 12 3 8 #2 ccc8_ccc1_fff11 8 1 11 #3 hhh15_ggg2_hhh13 15 2 13 #4 fff9_bbb3_ccc9 9 3 9 #5 ggg4_eee2_jjj14 4 2 14 #6 jjj7_ddd9_bbb15 7 9 15 head(dt3) # x V1 V2 V3 #1: ccc12_ccc3_ggg8 12 3 8 #2: ccc8_ccc1_fff11 8 1 11 #3: hhh15_ggg2_hhh13 15 2 13 #4: fff9_bbb3_ccc9 9 3 9 #5: ggg4_eee2_jjj14 4 2 14 #6: jjj7_ddd9_bbb15 7 9 15 A.K. ________________________________ From: Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> To: arun <smartpink...@yahoo.com> Cc: R help <r-help@r-project.org> Sent: Saturday, June 8, 2013 5:59 PM Subject: Re: [R] splitting a string column into multiple columns faster Thanks again, guys! Arun's method worked. I have over 270,000 rows and it took me 1 min. Dimitri On Sat, Jun 8, 2013 at 7:47 AM, Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> wrote: Thank you so much, Jorge and Arun - I'll give it a try! >Dimitri > > > >On Fri, Jun 7, 2013 at 11:27 PM, arun <smartpink...@yahoo.com> wrote: > >HI, >>Tried it on 1e5 row dataset: >> >>l1<- letters[1:10] >>s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse="")) >>set.seed(24) >>x1<-data.frame(x=paste(paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),sep="_"),stringsAsFactors=FALSE) >>system.time(resNew<-data.frame(x=x1,read.table(text=gsub("[A-Za-z]","",x1[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE)) >># user system elapsed >># 2.712 0.016 2.732 >> >>head(resNew) >> >># x V1 V2 V3 >>#1 ccc12_ggg2_jjj14 12 2 14 >>#2 ccc7_ddd15_aaa11 7 15 11 >>#3 hhh12_ddd14_fff12 12 14 12 >>#4 fff11_bbb15_aaa6 11 15 6 >>#5 ggg12_ccc9_ggg8 12 9 8 >>#6 jjj8_eee12_eee4 8 12 4 >> >> >>A.K. >> >> >>----- Original Message ----- >> >>From: arun <smartpink...@yahoo.com> >>To: Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> >>Cc: R help <r-help@r-project.org> >>Sent: Friday, June 7, 2013 11:00 PM >>Subject: Re: [R] splitting a string column into multiple columns faster >> >>HI, >>May be this helps: >> >>res<-data.frame(x=x,read.table(text=gsub("[A-Za-z]","",x[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE) >>res >># x V1 V2 V3 >>#1 aaa1_bbb1_ccc3 1 1 3 >>#2 aaa2_bbb3_ccc2 2 3 2 >>#3 aaa3_bbb2_ccc1 3 2 1 >>A.K. >> >>----- Original Message ----- >>From: Dimitri Liakhovitski <dimitri.liakhovit...@gmail.com> >>To: r-help <r-help@r-project.org> >>Cc: >>Sent: Friday, June 7, 2013 9:24 PM >>Subject: [R] splitting a string column into multiple columns faster >> >>Hello! >> >>I have a column in my data frame that I have to split: I have to distill >>the numbers from the text. Below is my example and my solution. >> >>x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1")) >>x >>library(stringr) >>out<-as.data.frame(str_split_fixed(x$x,"aaa",2)) >>out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2)) >>out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2)) >>result<-cbind(x,out2[1],out3) >>result >>My problem is: >>str_split.fixed is relatively slow. In my real data frame I have over >>80,000 rows so that it takes almost 30 seconds to run just one line (like >>out<-... above) >>And it's even slower because I have to do it step-by-step many times. >> >>Any way to do it by specifying all 3 delimiters at once >>("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with >>several columns? >> >>Thanks a lot for any pointers! >> >>-- >>Dimitri Liakhovitski >> >> [[alternative HTML version deleted]] >> >>______________________________________________ >>R-help@r-project.org mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. >> >> > > >-- > >Dimitri Liakhovitski -- Dimitri Liakhovitski ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.