Hello! data.table diverged over 7 years ago from this code when Matt
rewrote forder.
> This code was originally contributed by data.table. I believe Michael
> Lawrence handled the integration at the time. There were a number of
> issues like this early on that were resolved on the R side and I
> believe contributed back to data.table. If you have the energy it
> might be good to compare the two now and see if there are things that
> should be ported from one to the other.
The original code had the following note which is also mentioned
multiple times but not explicitly part of radixsort.c
>> Note on challenges implementing `nalast=NA` so that things are under the
>> same perspective when looked at later. To go to the point, there are four
>> challenges in implementing `nalast=NA` with the current form of forder.c
>> (which is excellent!):
>>
>> 1) Handling the i|d|csorted routines for nalast=NA:
>> The current implementation of nalast=NA in *sorted routine is that if
>> all values are NAs, then it returns -2, a different value, so that all
>> values of o/osub can be replacedw with 0. If any values are NA, then we
>> assume that this vector is unsorted and return 0 to be taken care of by the
>> corresponding sort routine. Now, this may very well be questioned/improved.
>> For ex: we could detect if the vector is sorted and also if the first value
>> is NA, and if so, get the group size of these NAs and replace that many
>> indices with 0. This is not yet done, but could very well be implemented
>> later. If there are no NAs in the vector, then things are the same.
>>
>> 2) Handling cases where n==1 and n==2:
>> The current sorting routines don't take care of n<=2, and rightly so,
>> as they can be dealt with directly. But it's not the case for nalast=NA
>> because, when n==2 and all are NAs, point 1 will handle it. But when n==1
>> and it is NA and the column is not the first column to order upon, with the
>> current implementation, "thisgrpn" is checked for "== 1" and if so, just
>> pushes that group size and continues.. But we've to replace it with 0 in our
>> case. So, nalast==0 is checked for inside that if statement and if so, if
>> the value is NA is checked for all types and if so, replaced with 0. For
>> n==2 and "any" is NA, we've to introduce a special case for this in each
>> sort routine to take care of and not result in an error (which tells this
>> should be already taken care of in *sorted).
>>
>> 3) o[0] and newo[0] = -1:
>> Since nalast already populates NAs with 0 values in o, we can't use the
>> old value of 0 to check if it's already populated or not. Therefore this has
>> been changed to -1.
>>
>> 4) default cases are untouched as much as possible:
>> In order to not compromise in speed for default case (`setkey`),
>> iinsert and dinsert are not touched at all, which means, we've to make sure
>> that they are not run when n<200 and nalast==0, as they don't replace
>> o[x=NA] with 0. And, 'icount', 'iradix', 'dradix' are modified to take care
>> of nalast=NA, to my knowledge.
However, case 2) does not cater for Ivans identified corner case
example of order(NA_character_, "c", method = "radix", na.last = NA),
which is not tied to n==1 or n==2 but rather to thisgrpn==1 and
x=c(NA_character_, another_string).
Hence, the missing case is indeed to cater for o[i] = 0 with Ivans
1-liner or more explicitly via
--- src/main/radixsort.c
@@ -1766,7 +1766,12 @@
// this edge case had to be taken care of
// here.. (see the bottom of this file for
// more explanation)
- switch (TYPEOF(x)) {
+ if (o[i] == 0) {
+ isSorted = false;
+ i++;
+ push(1);
+ continue;
+ } switch (TYPEOF(x)) {
case INTSXP:
if (INTEGER(x)[o[i] - 1] == NA_INTEGER) {
isSorted = false;
Best, Ben
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel