Your message dated Mon, 23 Sep 2013 13:12:07 -0600
with message-id <[email protected]>
and subject line Re: Bug#724326: [coreutils] sort program does not sort . (dot 
0x2E) and - (hyphen 0x2D)
has caused the Debian Bug report #724326,
regarding [coreutils] sort program does not sort . (dot 0x2E) and - (hyphen 
0x2D)
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
724326: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=724326
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: coreutils
Version: 8.13-3.5
Severity: important

--- Please enter the report below this line. ---

The 'sort' program does not sort . (dot 0x2E) and - (hyphen 0x2D)

For example: a file containing . and - interleaved.
-------------------------------
$ cat sort_bug.txt
. 1
- 2
. 3
- 4
$ od -cbd sort_bug.txt
0000000   .       1  \n   -       2  \n   .       3  \n   -       4  \n
        056 040 061 012 055 040 062 012 056 040 063 012 055 040 064 012
           8238    2609    8237    2610    8238    2611    8237    2612
0000020
-------------------------------

The above file's content can not be sorted by 'sort' program
-------------------------------
$ sort sort_bug.txt
. 1
- 2
. 3
- 4
-------------------------------

Since the ASCII value of '-' is less than '.', the lines beginning with '-' 
should
appear before the lines beginning with '.'.

A C program for sorting strings is able to sort the content of above file 
correctly.
-------------------------------
$ cat sort_bug.txt | ./sort_strs 
- 2
- 4
. 1
. 3
-------------------------------


PS:- The similar issue is observed on Ubuntu 12.04 (LTS x86_64) and RHEL 6.2 
(x86_64) too.


System Information
-------------------------------
$ uname -a
Linux reenu-pc 3.2.5Mitesh #2 SMP Mon Aug 5 03:12:23 IST 2013 x86_64 GNU/Linux
-------------------------------


--- System information. ---
Architecture: amd64
Kernel:       Linux 3.2.5Mitesh

Debian Release: 7.0

--- Package information. ---
Package's Depends field is empty.

Package's Recommends field is empty.

Package's Suggests field is empty.


-- 
Thanks and Regards,
Mitesh Singh Jat

--- End Message ---
--- Begin Message ---
Mitesh Singh Jat wrote:
> The 'sort' program does not sort . (dot 0x2E) and - (hyphen 0x2D)

This is one of the FAQs.

  
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

> For example: a file containing . and - interleaved.

Thank you for the excellent test case!  It was very nicely done.

> The above file's content can not be sorted by 'sort' program
> -------------------------------
> $ sort sort_bug.txt
> . 1
> - 2
> . 3
> - 4

The sort program is affected by your locale setting.  What is your
current locale setting?  It appears to be one of the human language
locales such as en_US.UTF-8 or other similar.  In those locales case
is folded and punctuation is ignored.

> Since the ASCII value of '-' is less than '.', the lines beginning
> with '-' should appear before the lines beginning with '.'.

You don't like it and I don't like it but the powers that be who set
up the libc locale tables confused data with language.  You want to
work with data but they defaulted everything to language.  They folded
case and ignored punctuation.  Therefore the dot and dash in your
example are irrelevant since they are ignored in dictionary sorting.

The coreutils sort documentation includes this text:

     (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
  `en_US'), then `sort' may produce output that is sorted differently
  than you're accustomed to.  In that case, set the `LC_ALL' environment
  variable to `C'.  Note that setting only `LC_COLLATE' has two problems.
  First, it is ineffective if `LC_ALL' is also set.  Second, it has
  undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
  set to an incompatible value.  For example, you get undefined behavior
  if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

The sort man page says:

       ***  WARNING  ***  The locale specified by the environment affects sort
       order.  Set LC_ALL=C to get the traditional sort order that uses native
       byte values.

If you want to sort based upon punctuation and case then you must set
LC_ALL=C in order to get a standard sorting order.  That selects the
standard sort ordering, standard meaning US-ASCII, and sort will
operate as you expect.

I think setting LC_ALL=C everywhere is too broad of a control.  It
would disable unicode for example.  Personally in my $HOME/.profile I
set the following combination of variables.  This sets the collation
sequence to C but leaves everything else en_US.UTF-8 so that unicode
character sets and accented characters and so forth still work as
intended.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

That works well for en_US.UTF-8 but I have no idea how setting
LC_COLLATE would interact with some settings such as chinese big5 for
example.  But for many locales it is a good compromise.

This discussion comes up periodically.  The FAQ entry dates back to
2001.  This used to be discussed more often when the change to locale
based sorting was new.  But now it has been this way for more than a
decade.

Since this is intentional behavior I am marking the bug as closed.
But please feel free to follow-up and continue the discussion.  We
will all receive the messages and the information would be useful in
the archive for others reading it later.

Bob

Attachment: signature.asc
Description: Digital signature


--- End Message ---

Reply via email to