On Thu, 2007-02-01 at 23:34 +0000, Prof Brian Ripley wrote: > On Thu, 1 Feb 2007, Marc Schwartz wrote: > > > Christos, > > > > Hmmmm....according to the Value section in ?merge: > > > > A data frame. The rows are by default lexicographically sorted on the > > common columns, but for sort=FALSE are in an unspecified order. > > There is also a sort in the .Internal code. But I am not buying > that this is a major part of the time without detailed evidence from > profiling. Sorting 35k numbers should take a few milliseconds, and > less if they are already sorted. > > > x <- rnorm(35000) > > system.time(y <- sort(x, method="quick")) > [1] 0.003 0.001 0.004 0.000 0.000 > > system.time(sort(y, method="quick")) > [1] 0.002 0.000 0.001 0.000 0.000
Having had a chance to mock up some examples, I would have to agree with Prof. Ripley on this point. Presuming that we are not missing something about the nature of Christos' data sets, here are 4 examples, with rows sorted in ascending order, descending order, reversed sort order and random order. In theory, the descending order example should, I believe, represent a worst cast scenario, since reverse sorting a sorted list is typically slowest. However, note that there is not much time variation below and running each of the examples several times resulted in material differences across runs. 1. Ascending order DF.X <- data.frame(X = 1:35000, Y = runif(35000)) DF.Y <- data.frame(X = 1:35000, Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 0.249 0.004 0.264 0.000 0.000 2. Descending order DF.X <- data.frame(X = 35000:1, Y = runif(35000)) DF.Y <- data.frame(X = 35000:1, Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 0.300 0.007 0.309 0.000 0.000 3. Reversed sort order DF.X <- data.frame(X = 35000:1, Y = runif(35000)) DF.Y <- data.frame(X = 1:35000, Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 0.236 0.008 0.245 0.000 0.000 4. Random order DF.X <- data.frame(X = sample(35000), Y = runif(35000)) DF.Y <- data.frame(X = sample(35000), Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 0.339 0.016 0.357 0.000 0.000 Spending some time looking at profiling the descending order example, we get: > summaryRprof() $by.self self.time self.pct total.time total.pct "duplicated.default" 0.16 38.1 0.16 38.1 "match" 0.08 19.0 0.08 19.0 "sort.list" 0.08 19.0 0.08 19.0 "[.data.frame" 0.04 9.5 0.24 57.1 "merge.data.frame" 0.02 4.8 0.42 100.0 "names.default" 0.02 4.8 0.02 4.8 "seq_len" 0.02 4.8 0.02 4.8 "merge" 0.00 0.0 0.42 100.0 "[" 0.00 0.0 0.24 57.1 "any" 0.00 0.0 0.18 42.9 "duplicated" 0.00 0.0 0.18 42.9 "cbind" 0.00 0.0 0.04 9.5 "data.frame" 0.00 0.0 0.04 9.5 "data.row.names" 0.00 0.0 0.02 4.8 "names" 0.00 0.0 0.02 4.8 "row.names<-" 0.00 0.0 0.02 4.8 "row.names<-.data.frame" 0.00 0.0 0.02 4.8 $by.total total.time total.pct self.time self.pct "merge.data.frame" 0.42 100.0 0.02 4.8 "merge" 0.42 100.0 0.00 0.0 "[.data.frame" 0.24 57.1 0.04 9.5 "[" 0.24 57.1 0.00 0.0 "any" 0.18 42.9 0.00 0.0 "duplicated" 0.18 42.9 0.00 0.0 "duplicated.default" 0.16 38.1 0.16 38.1 "match" 0.08 19.0 0.08 19.0 "sort.list" 0.08 19.0 0.08 19.0 "cbind" 0.04 9.5 0.00 0.0 "data.frame" 0.04 9.5 0.00 0.0 "names.default" 0.02 4.8 0.02 4.8 "seq_len" 0.02 4.8 0.02 4.8 "data.row.names" 0.02 4.8 0.00 0.0 "names" 0.02 4.8 0.00 0.0 "row.names<-" 0.02 4.8 0.00 0.0 "row.names<-.data.frame" 0.02 4.8 0.00 0.0 $sampling.time [1] 0.42 The above suggests that a meaningful amount of time is spent in checking for and dealing with duplicates in the common ('by') columns. To that end: DF.X <- data.frame(X = sample(10000, 35000, replace = TRUE), Y = runif(35000)) DF.Y <- data.frame(X = sample(10000, 35000, replace = TRUE), Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 3.316 0.148 3.502 0.000 0.000 So, it would seem that introducing duplicate values in the same sized vector space does indeed materially affect the time required: > summaryRprof() $by.self self.time self.pct total.time total.pct "duplicated.default" 0.86 27.6 0.86 27.6 "make.unique" 0.76 24.4 0.76 24.4 "[.data.frame" 0.72 23.1 1.70 54.5 "data.frame" 0.18 5.8 0.38 12.2 "row.names<-.data.frame" 0.14 4.5 0.36 11.5 "names<-.default" 0.14 4.5 0.14 4.5 "merge.data.frame" 0.08 2.6 3.12 100.0 "order" 0.06 1.9 0.06 1.9 "rbind" 0.04 1.3 0.44 14.1 "[[<-.data.frame" 0.04 1.3 0.04 1.3 "match" 0.04 1.3 0.04 1.3 "unclass" 0.04 1.3 0.04 1.3 "unlist" 0.02 0.6 0.02 0.6 "merge" 0.00 0.0 3.12 100.0 "[" 0.00 0.0 1.70 54.5 "any" 0.00 0.0 0.86 27.6 "duplicated" 0.00 0.0 0.86 27.6 "cbind" 0.00 0.0 0.38 12.2 "row.names<-" 0.00 0.0 0.36 11.5 "do.call" 0.00 0.0 0.18 5.8 "names<-" 0.00 0.0 0.14 4.5 "data.row.names" 0.00 0.0 0.10 3.2 "[[<-" 0.00 0.0 0.04 1.3 $by.total total.time total.pct self.time self.pct "merge.data.frame" 3.12 100.0 0.08 2.6 "merge" 3.12 100.0 0.00 0.0 "[.data.frame" 1.70 54.5 0.72 23.1 "[" 1.70 54.5 0.00 0.0 "duplicated.default" 0.86 27.6 0.86 27.6 "any" 0.86 27.6 0.00 0.0 "duplicated" 0.86 27.6 0.00 0.0 "make.unique" 0.76 24.4 0.76 24.4 "rbind" 0.44 14.1 0.04 1.3 "data.frame" 0.38 12.2 0.18 5.8 "cbind" 0.38 12.2 0.00 0.0 "row.names<-.data.frame" 0.36 11.5 0.14 4.5 "row.names<-" 0.36 11.5 0.00 0.0 "do.call" 0.18 5.8 0.00 0.0 "names<-.default" 0.14 4.5 0.14 4.5 "names<-" 0.14 4.5 0.00 0.0 "data.row.names" 0.10 3.2 0.00 0.0 "order" 0.06 1.9 0.06 1.9 "[[<-.data.frame" 0.04 1.3 0.04 1.3 "match" 0.04 1.3 0.04 1.3 "unclass" 0.04 1.3 0.04 1.3 "[[<-" 0.04 1.3 0.00 0.0 "unlist" 0.02 0.6 0.02 0.6 $sampling.time [1] 3.12 I would also point out the following: > str(DF.XY) 'data.frame': 124704 obs. of 3 variables: $ X : int 1 1 1 1 1 1 1 1 1 1 ... $ Y.x: num 0.7233 0.7233 0.7233 0.0577 0.0577 ... $ Y.y: num 0.805 0.742 0.324 0.805 0.742 ... Note that the resultant data frame is NOT 35,000 rows as a consequence of the multiple matches. Presumably, this gets worse with each subsequent merge back to the prior result. For example: > system.time(DF.XY2 <- merge(DF.XY, DF.Y, by = "X", all = TRUE)) [1] 12.270 1.173 13.624 0.000 0.000 > str(DF.XY2) 'data.frame': 557116 obs. of 4 variables: $ X : int 1 1 1 1 1 1 1 1 1 1 ... $ Y.x: num 0.00213 0.00213 0.00213 0.85017 0.85017 ... $ Y.y: num 0.324 0.324 0.324 0.324 0.324 ... $ Y : num 0.742 0.324 0.805 0.742 0.324 ... I may be extrapolating beyond known data here, but Christos, the above suggests that your data sets have some proportion of duplicate values in the columns that you are using for matching. HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.