Thank you Neal. This is interesting. I will have a look at pqR. Indeed radix only does C collation, I believe that is why it is not the default choice for character ordering and sorting. Not sure but I believe it can help address the following bugzilla item: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400
On the same topic of collation, there is an experimental sorting function "psort" in package kit that might help address this issue. > library(kit) Attaching kit 0.0.7 (OPENMP enabled using 1 thread) > x <- c("b","A","B","a","\xe4") > Encoding(x) <- "latin1" > identical(psort(x, c.locale=FALSE), sort(x)) [1] TRUE > identical(psort(x, c.locale=TRUE), sort(x, method="radix")) [1] TRUE Coming back to the topic of fsort, I have just finished the implementation for double, integer, factor and logical. The implementation takes into account NA, Inf.. values. Values can be sorted in a decreasing order or increasing order. Comparing benchmark with the current implementation in data.table, it is currently over 30% faster. There might bugs but I am sure performance can be further improved as I did not really try hard. If there is interest in both the implementation and cross community sharing, please let know.... Best regards, Morgan On Wed, 17 Mar 2021, 00:37 Radford Neal, <radf...@cs.toronto.edu> wrote: > Those interested in faster sorting may want to look at the merge sort > implemented in pqR (see pqR-project.org). It's often used as the > default, because it is stable, and does different collations, while > being faster than shell sort (except for small vectors). > > Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, > compiled identically: > > ----------------------------- > pqR-2020-07-23 in C locale: > > > set.seed(1) > > N <- 1000000 > > x <- as.character (sample(N,N,replace=TRUE)) > > print(system.time (os <- order(x,method="shell"))) > user system elapsed > 1.332 0.000 1.334 > > print(system.time (or <- order(x,method="radix"))) > user system elapsed > 0.092 0.004 0.096 > > print(system.time (om <- order(x,method="merge"))) > user system elapsed > 0.363 0.000 0.363 > > print(identical(os,or)) > [1] TRUE > > print(identical(os,om)) > [1] TRUE > > > > x <- c("a","~") > > print(order(x,method="shell")) > [1] 1 2 > > print(order(x,method="radix")) > [1] 1 2 > > print(order(x,method="merge")) > [1] 1 2 > > ----------------------------- > R-4.0.2 in C locale: > > > set.seed(1) > > N <- 1000000 > > x <- as.character (sample(N,N,replace=TRUE)) > > print(system.time (os <- order(x,method="shell"))) > user system elapsed > 2.381 0.004 2.387 > > print(system.time (or <- order(x,method="radix"))) > user system elapsed > 0.138 0.000 0.137 > > #print(system.time (om <- order(x,method="merge"))) > > print(identical(os,or)) > [1] TRUE > > #print(identical(os,om)) > > > > x <- c("a","~") > > print(order(x,method="shell")) > [1] 1 2 > > print(order(x,method="radix")) > [1] 1 2 > > #print(order(x,method="merge")) > > ------------------------------------ > pqR-2020-07-23 in fr_CA.utf8 locale: > > > set.seed(1) > > N <- 1000000 > > x <- as.character (sample(N,N,replace=TRUE)) > > print(system.time (os <- order(x,method="shell"))) > utilisateur système écoulé > 2.960 0.000 2.962 > > print(system.time (or <- order(x,method="radix"))) > utilisateur système écoulé > 0.083 0.008 0.092 > > print(system.time (om <- order(x,method="merge"))) > utilisateur système écoulé > 1.143 0.000 1.142 > > print(identical(os,or)) > [1] TRUE > > print(identical(os,om)) > [1] TRUE > > > > x <- c("a","~") > > print(order(x,method="shell")) > [1] 2 1 > > print(order(x,method="radix")) > [1] 1 2 > > print(order(x,method="merge")) > [1] 2 1 > > ------------------------------------ > R-4.0.2 in fr_CA.utf8 locale: > > > set.seed(1) > > N <- 1000000 > > x <- as.character (sample(N,N,replace=TRUE)) > > print(system.time (os <- order(x,method="shell"))) > utilisateur système écoulé > 4.222 0.016 4.239 > > print(system.time (or <- order(x,method="radix"))) > utilisateur système écoulé > 0.136 0.000 0.137 > > #print(system.time (om <- order(x,method="merge"))) > > print(identical(os,or)) > [1] TRUE > > #print(identical(os,om)) > > > > x <- c("a","~") > > print(order(x,method="shell")) > [1] 2 1 > > print(order(x,method="radix")) > [1] 1 2 > > #print(order(x,method="merge")) > > > pqR is faster in all the tests, but more relevant to this discussion > is that the "merge" method is substantially faster than "shell" for > these character vectors with a million elements, while retaining the > stability and collation properties of "shell" (whereas "radix" only > does C collation). > > It would probably not be too hard to take the merge sort code from pqR > and use it in R core's implementation. > > The merge sort in pqR doesn't exploit parallelism at the moment, but > merge sort is potentially quite parallelizable (though I think the > storage allocation strategy I use would have to be modified). > > Regards, > > Radford Neal > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel