Re: [R] What "method" does sort() use?
I did look at the Comparison help but totally overlooked this part; Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see ‘locales’. The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising. I've recently changed to a different Linux distribution and was trying to work out why I was getting a different order of the factor levels even though it was the same code. I thought I'd inadvertantly changed something somehow. Much clearer now -- even if still confusing. Thanks for the pointer. On Fri, 18-Mar-2016 at 10:19AM +0100, peter dalgaard wrote: |> Ooops, that was answering the question you actually asked. The one you meant to ask is answered by this part: |> |> The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison. |> |> ...and collating sequences is a weird and woolly subject, where you cannot even be sure that locales of the same name on two different platforms sort strings in the same order. |> |> -pd |> |> |> |> On 18 Mar 2016, at 10:13 , peter dalgaardwrote: |> |> > |> > On 18 Mar 2016, at 10:02 , Patrick Connolly wrote: |> > |> >> I don't follow why this happens: |> >> |> >>> sort(c(LETTERS[1:5], letters[1:5])) |> >> [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" |> >> |> >> The help for sort() says: |> >> |> >> method: character string specifying the algorithm used. Not |> >> available for partial sorting. Can be abbreviated. |> >> |> >> But what are the methods available? The help mentions xtfrm but that |> >> doesn't illuminate, I'd have thought that at least by default it would |> >> have something to do with ASCII codes. But that's not the case since |> >> all the uppercase ones would be before the lowercase ones. |> >> |> >> I know something different is happening but I don't know what it is |> >> (do you, Mr Jones?). Apologies to Bob Dylan. |> >> |> > |> > |> > Um, read _all_ of the help file? |> > |> > sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE, |> > method = c("shell", "quick"), index.return = FALSE) |> > |> > [snip] |> > |> > Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). If x has names a stable modification is used, so ties are not reordered. (This only matters if names are present.) |> > |> > Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort method and is only available when x is numeric (double or integer) and partial is NULL. (For other types of x Shellsort is used, silently.) It is normally somewhat faster than Shellsort (perhaps 50% faster on vectors of length a million and twice as fast at a billion) but has poor performance in the rare worst case. (Peto's modification using a pseudo-random midpoint is used to make the worst case rarer.) This is not a stable sort, and ties may be reordered. |> > |> > Factors with less than 100,000 levels are sorted by radix sorting when method is not supplied: see sort.list. |> > |> > -pd |> > |> > |> > -- |> > Peter Dalgaard, Professor, |> > Center for Statistics, Copenhagen Business School |> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark |> > Phone: (+45)38153501 |> > Office: A 4.23 |> > Email: pd@cbs.dk Priv: pda...@gmail.com |> |> -- |> Peter Dalgaard, Professor, |> Center for Statistics, Copenhagen Business School |> Solbjerg Plads 3, 2000 Frederiksberg, Denmark |> Phone: (+45)38153501 |> Office: A 4.23 |> Email: pd@cbs.dk Priv: pda...@gmail.com |> -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) . Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What "method" does sort() use?
Ooops, that was answering the question you actually asked. The one you meant to ask is answered by this part: The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison. ...and collating sequences is a weird and woolly subject, where you cannot even be sure that locales of the same name on two different platforms sort strings in the same order. -pd On 18 Mar 2016, at 10:13 , peter dalgaardwrote: > > On 18 Mar 2016, at 10:02 , Patrick Connolly > wrote: > >> I don't follow why this happens: >> >>> sort(c(LETTERS[1:5], letters[1:5])) >> [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" >> >> The help for sort() says: >> >> method: character string specifying the algorithm used. Not >> available for partial sorting. Can be abbreviated. >> >> But what are the methods available? The help mentions xtfrm but that >> doesn't illuminate, I'd have thought that at least by default it would >> have something to do with ASCII codes. But that's not the case since >> all the uppercase ones would be before the lowercase ones. >> >> I know something different is happening but I don't know what it is >> (do you, Mr Jones?). Apologies to Bob Dylan. >> > > > Um, read _all_ of the help file? > > sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE, > method = c("shell", "quick"), index.return = FALSE) > > [snip] > > Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). > If x has names a stable modification is used, so ties are not reordered. > (This only matters if names are present.) > > Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort > method and is only available when x is numeric (double or integer) and > partial is NULL. (For other types of x Shellsort is used, silently.) It is > normally somewhat faster than Shellsort (perhaps 50% faster on vectors of > length a million and twice as fast at a billion) but has poor performance in > the rare worst case. (Peto's modification using a pseudo-random midpoint is > used to make the worst case rarer.) This is not a stable sort, and ties may > be reordered. > > Factors with less than 100,000 levels are sorted by radix sorting when method > is not supplied: see sort.list. > > -pd > > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd@cbs.dk Priv: pda...@gmail.com -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] What "method" does sort() use?
I don't follow why this happens: > sort(c(LETTERS[1:5], letters[1:5])) [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" The help for sort() says: method: character string specifying the algorithm used. Not available for partial sorting. Can be abbreviated. But what are the methods available? The help mentions xtfrm but that doesn't illuminate, I'd have thought that at least by default it would have something to do with ASCII codes. But that's not the case since all the uppercase ones would be before the lowercase ones. I know something different is happening but I don't know what it is (do you, Mr Jones?). Apologies to Bob Dylan. -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) . Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] What "method" does sort() use?
On 18 Mar 2016, at 10:02 , Patrick Connollywrote: > I don't follow why this happens: > >> sort(c(LETTERS[1:5], letters[1:5])) > [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" > > The help for sort() says: > > method: character string specifying the algorithm used. Not > available for partial sorting. Can be abbreviated. > > But what are the methods available? The help mentions xtfrm but that > doesn't illuminate, I'd have thought that at least by default it would > have something to do with ASCII codes. But that's not the case since > all the uppercase ones would be before the lowercase ones. > > I know something different is happening but I don't know what it is > (do you, Mr Jones?). Apologies to Bob Dylan. > Um, read _all_ of the help file? sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE, method = c("shell", "quick"), index.return = FALSE) [snip] Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). If x has names a stable modification is used, so ties are not reordered. (This only matters if names are present.) Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort method and is only available when x is numeric (double or integer) and partial is NULL. (For other types of x Shellsort is used, silently.) It is normally somewhat faster than Shellsort (perhaps 50% faster on vectors of length a million and twice as fast at a billion) but has poor performance in the rare worst case. (Peto's modification using a pseudo-random midpoint is used to make the worst case rarer.) This is not a stable sort, and ties may be reordered. Factors with less than 100,000 levels are sorted by radix sorting when method is not supplied: see sort.list. -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.