Those interested in faster sorting may want to look at the merge sort implemented in pqR (see pqR-project.org). It's often used as the default, because it is stable, and does different collations, while being faster than shell sort (except for small vectors).
Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, compiled identically: ----------------------------- pqR-2020-07-23 in C locale: > set.seed(1) > N <- 1000000 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) user system elapsed 1.332 0.000 1.334 > print(system.time (or <- order(x,method="radix"))) user system elapsed 0.092 0.004 0.096 > print(system.time (om <- order(x,method="merge"))) user system elapsed 0.363 0.000 0.363 > print(identical(os,or)) [1] TRUE > print(identical(os,om)) [1] TRUE > > x <- c("a","~") > print(order(x,method="shell")) [1] 1 2 > print(order(x,method="radix")) [1] 1 2 > print(order(x,method="merge")) [1] 1 2 ----------------------------- R-4.0.2 in C locale: > set.seed(1) > N <- 1000000 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) user system elapsed 2.381 0.004 2.387 > print(system.time (or <- order(x,method="radix"))) user system elapsed 0.138 0.000 0.137 > #print(system.time (om <- order(x,method="merge"))) > print(identical(os,or)) [1] TRUE > #print(identical(os,om)) > > x <- c("a","~") > print(order(x,method="shell")) [1] 1 2 > print(order(x,method="radix")) [1] 1 2 > #print(order(x,method="merge")) ------------------------------------ pqR-2020-07-23 in fr_CA.utf8 locale: > set.seed(1) > N <- 1000000 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) utilisateur système écoulé 2.960 0.000 2.962 > print(system.time (or <- order(x,method="radix"))) utilisateur système écoulé 0.083 0.008 0.092 > print(system.time (om <- order(x,method="merge"))) utilisateur système écoulé 1.143 0.000 1.142 > print(identical(os,or)) [1] TRUE > print(identical(os,om)) [1] TRUE > > x <- c("a","~") > print(order(x,method="shell")) [1] 2 1 > print(order(x,method="radix")) [1] 1 2 > print(order(x,method="merge")) [1] 2 1 ------------------------------------ R-4.0.2 in fr_CA.utf8 locale: > set.seed(1) > N <- 1000000 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) utilisateur système écoulé 4.222 0.016 4.239 > print(system.time (or <- order(x,method="radix"))) utilisateur système écoulé 0.136 0.000 0.137 > #print(system.time (om <- order(x,method="merge"))) > print(identical(os,or)) [1] TRUE > #print(identical(os,om)) > > x <- c("a","~") > print(order(x,method="shell")) [1] 2 1 > print(order(x,method="radix")) [1] 1 2 > #print(order(x,method="merge")) pqR is faster in all the tests, but more relevant to this discussion is that the "merge" method is substantially faster than "shell" for these character vectors with a million elements, while retaining the stability and collation properties of "shell" (whereas "radix" only does C collation). It would probably not be too hard to take the merge sort code from pqR and use it in R core's implementation. The merge sort in pqR doesn't exploit parallelism at the moment, but merge sort is potentially quite parallelizable (though I think the storage allocation strategy I use would have to be modified). Regards, Radford Neal ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel