tag 17188 notabug thanks On 04/04/2014 08:07 PM, Nikos Balkanas wrote: > Hi, > > Sort is seriously bugged. This is the output from: > > sort -d -t \t -k1 input > out
-d says to do a dictionary sort that ignores non-alphanumeric characters. But it still leaves it up to your current locale on whether those non-alpha characters are collated case-insensitively. Also, '-k1' is almost always wrong - you generally want '-k1,1' if you want to sort by JUST the first field, rather than by the whole line. See the FAQ: https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 > > 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/ > 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG > 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ > 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T > 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr > > Shouldn't 00/0 be first according to Ascii code? Only if you are asking for a full ASCII sort. Here, I'm adding -s for fewer lines, but using --debug can sometimes help show you where you are asking sort to do something different than you expected, but where sort is behaving correctly given what you asked it to do. I'm guessing your default locale is en_US.UTF-8 - because I get the same results as you in that mode: $ sort --debug -s -d -t \t -k1 input sort: using ‘en_US.UTF-8’ sorting rules 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/ ___________________________________________________________________ 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG ______________________________________________________________________ 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ __________________________________________________________________ 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T __________________________________________________________________ 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr ___________________________________________________________________ In this mode, '000p' collates case-insensitively before '000Q', so the sort is correct (the collation was on '000Q' and not '00/0Q' because you used -d). Furthermore, if you omit -d: $ sort --debug -s -t \t -k1 input sort: using ‘en_US.UTF-8’ sorting rules 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/ ___________________________________________________________________ 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG ______________________________________________________________________ 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ __________________________________________________________________ 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T __________________________________________________________________ 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr ___________________________________________________________________ No change, because the en_US.UTF-8 locale implicitly does a dictionary collation even without you requesting -d. Now, compare to the C locale, which forces sorting by byte value for more traditional ASCII sorting: $ LC_ALL=C sort --debug -s -d -t \t -k1 input sort: using simple byte comparison 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/ ___________________________________________________________________ 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG ______________________________________________________________________ 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T __________________________________________________________________ 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr ___________________________________________________________________ 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ __________________________________________________________________ '000Q' now sorts before '000R' which sorts before '000p' as expected. And toss out the -d, and you get: $ LC_ALL=C sort --debug -s -t \t -k1 input sort: using simple byte comparison 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T __________________________________________________________________ 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/ ___________________________________________________________________ 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG ______________________________________________________________________ 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr ___________________________________________________________________ 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ __________________________________________________________________ Now '00/' sorts before '000'. It might be a nice improvement to the --debug output to avoid putting _ under any character that sort ignored due to -d before calling strcoll() (which would help the output of the LC_ALL=C case, but not the en_US.UTF-8 case) - but that's probably difficult to implement. > > Plz fix. There's nothing to fix but your usage pattern. So I'm closing this as not a bug. But feel free to reply further if you still have questions. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature