My apologies to Professor Neal. Thank you for correcting me. Best regards Morgan
On Mon, 22 Mar 2021, 05:05 , <frede...@ofb.net> wrote: > I think it is "Professor Neal" :) > > I also appreciate the pqR comparisons. > > On Wed, Mar 17, 2021 at 09:23:15AM +0000, Morgan Morgan wrote: > >Thank you Neal. This is interesting. I will have a look at pqR. > >Indeed radix only does C collation, I believe that is why it is not the > >default choice for character ordering and sorting. > >Not sure but I believe it can help address the following bugzilla item: > >https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400 > > > >On the same topic of collation, there is an experimental sorting function > >"psort" in package kit that might help address this issue. > > > >> library(kit) > >Attaching kit 0.0.7 (OPENMP enabled using 1 thread) > >> x <- c("b","A","B","a","\xe4") > >> Encoding(x) <- "latin1" > >> identical(psort(x, c.locale=FALSE), sort(x)) > >[1] TRUE > >> identical(psort(x, c.locale=TRUE), sort(x, method="radix")) > >[1] TRUE > > > >Coming back to the topic of fsort, I have just finished the implementation > >for double, integer, factor and logical. > >The implementation takes into account NA, Inf.. values. Values can be > >sorted in a decreasing order or increasing order. > >Comparing benchmark with the current implementation in data.table, it is > >currently over 30% faster. > >There might bugs but I am sure performance can be further improved as I > did > >not really try hard. > >If there is interest in both the implementation and cross community > >sharing, please let know.... > > > >Best regards, > >Morgan > > > >On Wed, 17 Mar 2021, 00:37 Radford Neal, <radf...@cs.toronto.edu> wrote: > > > >> Those interested in faster sorting may want to look at the merge sort > >> implemented in pqR (see pqR-project.org). It's often used as the > >> default, because it is stable, and does different collations, while > >> being faster than shell sort (except for small vectors). > >> > >> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, > >> compiled identically: > >> > >> ----------------------------- > >> pqR-2020-07-23 in C locale: > >> > >> > set.seed(1) > >> > N <- 1000000 > >> > x <- as.character (sample(N,N,replace=TRUE)) > >> > print(system.time (os <- order(x,method="shell"))) > >> user system elapsed > >> 1.332 0.000 1.334 > >> > print(system.time (or <- order(x,method="radix"))) > >> user system elapsed > >> 0.092 0.004 0.096 > >> > print(system.time (om <- order(x,method="merge"))) > >> user system elapsed > >> 0.363 0.000 0.363 > >> > print(identical(os,or)) > >> [1] TRUE > >> > print(identical(os,om)) > >> [1] TRUE > >> > > >> > x <- c("a","~") > >> > print(order(x,method="shell")) > >> [1] 1 2 > >> > print(order(x,method="radix")) > >> [1] 1 2 > >> > print(order(x,method="merge")) > >> [1] 1 2 > >> > >> ----------------------------- > >> R-4.0.2 in C locale: > >> > >> > set.seed(1) > >> > N <- 1000000 > >> > x <- as.character (sample(N,N,replace=TRUE)) > >> > print(system.time (os <- order(x,method="shell"))) > >> user system elapsed > >> 2.381 0.004 2.387 > >> > print(system.time (or <- order(x,method="radix"))) > >> user system elapsed > >> 0.138 0.000 0.137 > >> > #print(system.time (om <- order(x,method="merge"))) > >> > print(identical(os,or)) > >> [1] TRUE > >> > #print(identical(os,om)) > >> > > >> > x <- c("a","~") > >> > print(order(x,method="shell")) > >> [1] 1 2 > >> > print(order(x,method="radix")) > >> [1] 1 2 > >> > #print(order(x,method="merge")) > >> > >> ------------------------------------ > >> pqR-2020-07-23 in fr_CA.utf8 locale: > >> > >> > set.seed(1) > >> > N <- 1000000 > >> > x <- as.character (sample(N,N,replace=TRUE)) > >> > print(system.time (os <- order(x,method="shell"))) > >> utilisateur système écoulé > >> 2.960 0.000 2.962 > >> > print(system.time (or <- order(x,method="radix"))) > >> utilisateur système écoulé > >> 0.083 0.008 0.092 > >> > print(system.time (om <- order(x,method="merge"))) > >> utilisateur système écoulé > >> 1.143 0.000 1.142 > >> > print(identical(os,or)) > >> [1] TRUE > >> > print(identical(os,om)) > >> [1] TRUE > >> > > >> > x <- c("a","~") > >> > print(order(x,method="shell")) > >> [1] 2 1 > >> > print(order(x,method="radix")) > >> [1] 1 2 > >> > print(order(x,method="merge")) > >> [1] 2 1 > >> > >> ------------------------------------ > >> R-4.0.2 in fr_CA.utf8 locale: > >> > >> > set.seed(1) > >> > N <- 1000000 > >> > x <- as.character (sample(N,N,replace=TRUE)) > >> > print(system.time (os <- order(x,method="shell"))) > >> utilisateur système écoulé > >> 4.222 0.016 4.239 > >> > print(system.time (or <- order(x,method="radix"))) > >> utilisateur système écoulé > >> 0.136 0.000 0.137 > >> > #print(system.time (om <- order(x,method="merge"))) > >> > print(identical(os,or)) > >> [1] TRUE > >> > #print(identical(os,om)) > >> > > >> > x <- c("a","~") > >> > print(order(x,method="shell")) > >> [1] 2 1 > >> > print(order(x,method="radix")) > >> [1] 1 2 > >> > #print(order(x,method="merge")) > >> > >> > >> pqR is faster in all the tests, but more relevant to this discussion > >> is that the "merge" method is substantially faster than "shell" for > >> these character vectors with a million elements, while retaining the > >> stability and collation properties of "shell" (whereas "radix" only > >> does C collation). > >> > >> It would probably not be too hard to take the merge sort code from pqR > >> and use it in R core's implementation. > >> > >> The merge sort in pqR doesn't exploit parallelism at the moment, but > >> merge sort is potentially quite parallelizable (though I think the > >> storage allocation strategy I use would have to be modified). > >> > >> Regards, > >> > >> Radford Neal > >> > >> ______________________________________________ > >> R-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > >> > > > > [[alternative HTML version deleted]] > > > >______________________________________________ > >R-devel@r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-devel > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel