Marvelous! Thanks guys for your hints and time! Very smooth now! Joh
On Wednesday 26 November 2008 03:41:49 Henrik Bengtsson wrote: > Alright, here are another $.02: using 'use.names=FALSE' in unlist() is > much faster than the default 'use.names=TRUE'. /Henrik > > On Tue, Nov 25, 2008 at 6:40 PM, Henrik Bengtsson <[EMAIL PROTECTED]> wrote: > > My $.02: Using argument 'fixed=TRUE' in strsplit() is much faster than > > the default 'fixed=FALSE'. /Henrik > > > > On Tue, Nov 25, 2008 at 1:02 PM, William Dunlap <[EMAIL PROTECTED]> wrote: > >>> -----Original Message----- > >>> From: William Dunlap > >>> Sent: Tuesday, November 25, 2008 9:16 AM > >>> To: '[EMAIL PROTECTED]' > >>> Subject: Re: [R] Efficient passing through big data.frame and > >>> modifying select fields > >>> > >>> > Johannes Graumann johannes_graumann at web.de > >>> > Tue Nov 25 15:16:01 CET 2008 > >>> > > >>> > Hi all, > >>> > > >>> > I have relatively big data frames (> 10000 rows by 80 columns) > >>> > that need to be exposed to "merge". Works marvelously well in > >>> > general, but some fields of the data frames actually contain > >>> > multiple ";"-separated values encoded as a character string without > >>> > defined order, which makes the fields not match each other. > >>> > > >>> > Example: > >>> > > frame1[1,1] > >>> > > >>> > [1] "some;thing" > >>> > > >>> > >frame2[2,1] > >>> > > >>> > [2] "thing;some" > >>> > > >>> > In order to enable merging/duplicate identification of columns > >>> > containing these strings, I wrote the following function, which > >>> > passes through the rows one by one, identifies ";"-containing cells, > >>> > splits and resorts them. > >>> > > >>> > ResortCombinedFields <- function(dframe){ > >>> > if(!is.data.frame(dframe)){ > >>> > stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> > } > >>> > for(row in seq(nrow(dframe))){ > >>> > for(mef in grep(";",dframe[row,])){ > >>> > >>> I needed to add drop=TRUE to the above dframe[row,] for this to work. > >>> > >>> > dframe[row,mef] <- > >>> > >>> paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";") > >>> > >>> > } > >>> > } > >>> > return(dframe) > >>> > } > >>> > > >>> > works fine, but is horribly inefficient. How might this be > >>> > >>> tackled more elegantly? > >>> > >>> > Thanks for any input, Joh > >>> > >>> It is usually faster to loop over columns of an data frame and use row > >>> subscripting, if needed, on individual columns. E.g., the following > >>> 2 are much quicker on a sample 1000 by 4 dataset I made with > >>> > >>> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3), > >>> function(i)sapply(1:1000, > >>> function(i) > >>> > >>> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE), > >>> collapse=";"))), > >>> stringsAsFactors=FALSE) > >>> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe), > >>> replace=TRUE) # no ;'s in column Four > >>> > >>> The first function, f1, doesn't try to find which rows may > >>> need adjusting > >>> and the second, f2, does. > >>> > >>> f1 <- function(dframe){ > >>> if(!is.data.frame(dframe)){ > >>> stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> } > >>> for(icol in seq_len(ncol(dframe))){ > >>> dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol], > >>> ";"), function(parts) paste(sort(parts), collapse=";"))) > >>> } > >>> return(dframe) > >>> } > >>> > >>> f2 <- > >>> function(dframe){ > >>> if(!is.data.frame(dframe)){ > >>> stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> } > >>> for(icol in seq_len(ncol(dframe))){ > >>> col <- dframe[,icol] > >>> irow <- grep(";", col) > >>> if (length(irow)) { > >>> col[irow] <- unlist(lapply(strsplit(col[irow], ";"), > >>> function(parts) paste(sort(parts), collapse=";"))) > >>> dframe[,icol] <- col > >>> } > >>> } > >>> return(dframe) > >>> } > >>> > >>> Times were > >>> > >>> > unix.time(z<-ResortCombinedFields(dframe)) > >>> > >>> user system elapsed > >>> 2.526 0.022 2.559 > >>> > >>> > unix.time(f1z<-f1(dframe)) > >>> > >>> user system elapsed > >>> 0.509 0.000 0.508 > >>> > >>> > unix.time(f2z<-f2(dframe)) > >>> > >>> user system elapsed > >>> 0.259 0.004 0.264 > >>> > >>> > identical(z, f1z) > >>> > >>> [1] TRUE > >>> > >>> > identical(z, f2z) > >>> > >>> [1] TRUE > >> > >> In R 2.7.0 (April 2008) f1() and f2() both take time proportional > >> to nrow(dframe), while your original ResortCombinedFields() takes > >> time proportional to the square of nrow(dframe). E.g., for 50,000 > >> rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds > >> It looks like 2.9 acts about the same. > >> > >> Bill Dunlap > >> TIBCO Software Inc - Spotfire Division > >> wdunlap tibco.com > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html and provide commented, > >> minimal, self-contained, reproducible code.
signature.asc
Description: This is a digitally signed message part.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.