[EMAIL PROTECTED] writes: > What I've found, however, is that it is not easy (or I have not found the > easy way) to split a named vector into a list that retains the vector names. > For example, splitting an unnamed vector (70,000+) based on the chain > numbers takes very little time: > > system.time(actTimeList <- split(actTime, chainId)) > [1] 0.16 0.00 0.15 NA NA > > But if the vector is named, R will work for minutes and still not complete > the job: > > names(actTime) <- zoneNames > > system.time(actTimeList <- split(actTime, chainId)) > Timing stopped at: 83.22 0.12 84.49 NA NA > > The same thing happens with using tapply with a named vector such as: > tapply(actTime, chainId, function(x) x) > > Using the following function with a for loop accomplishes the job in a few > seconds for all 70,000+ records: > > splitWithNames <- function(dataVector, nameVector, factorVector){ > + dataList <- split(dataVector, factorVector) > + nameList <- split(nameVector, factorVector) > + listLength <- length(dataList) > + namedDataList <- list(NULL) > + for(i in 1:listLength){ > + x <- dataList[[i]] > + names(x) <- nameList[[i]] > + namedDataList[[i]] <- x > + } > + namedDataList > + } > > system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId)) > [1] 8.04 0.00 9.03 NA NA > > However if I rewrite the function to use mapply instead of a for loop, it > again takes a long (undetermined) amount of time to complete. Here are the > results for just 5000 and 10000 records. You can see that there is a > scaling issue: > > testfun <- function(dataVector, nameVector, factorVector){ > + dataList <- split(dataVector, factorVector) > + nameList <- split(nameVector, factorVector) > + nameFun <- function(x, xNames){ > + names(x) <- xNames > + x > + } > + mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE) > + } > > system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000], > chainId[1:5000])) > [1] 2.99 0.00 2.98 NA NA > > system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000], > chainId[1:10000])) > [1] 10.64 0.00 10.64 NA NA > > My problem is solved for now with the home-brew splitWithNames function, but > I'm curious about why named vectors slow down split and tapply so much and > why a function using mapply is so much slower than a function that uses a > for loop?
If you look inside split.default, you'll see that it only uses fast internal code in simple cases: if (is.null(attr(x, "class")) && is.null(names(x))) return(.Internal(split(x, f))) in the other cases, we use for (k in lf) y[[k]] <- x[f %in% k] and if lf is large, we get a large number of calls to %in%. This wasn't really designed for that case, but I suppose we could be smarter about it. Wouldn't know about mapply, but are you sure you want SIMPLIFY=TRUE in there??? -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html