[1003.1(2013)/Issue7+TC1 0001070]: Collation issues in XCU (changes for Issue 8)

Austin Group Bug Tracker Thu, 25 Aug 2016 04:13:15 -0700

The following issue has been SUBMITTED. 
====================================================================== 
http://austingroupbugs.net/view.php?id=1070 
====================================================================== 
Reported By:                geoffclare
Assigned To:                
====================================================================== 
Project:                    1003.1(2013)/Issue7+TC1
Issue ID:                   1070
Category:                   Shell and Utilities
Type:                       Error
Severity:                   Objection
Priority:                   normal
Status:                     New
Name:                       Geoff Clare 
Organization:               The Open Group 
User Reference:              
Section:                    2.13.3, awk, comm, localedef, ls, sort, uniq 
Page Number:                2356, 2459, 2559, 2874, 2888, 3210, 3309, and more 
Line Number:                75082, 78745, 82755, 94650, 95164, 107544, 111067,
and more 
Interp Status:              --- 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2016-08-25 11:11 UTC
Last Modified:              2016-08-25 11:11 UTC
====================================================================== 
Summary:                    Collation issues in XCU (changes for Issue 8)
Description: 
A discussion on the mailing list identified some issues related to
collation for locales that do not define a collation sequence with
a total ordering of all characters.  It is proposed that these issues
are addressed in Issue 8 by requiring implementation-provided locales
that do not have an '@' modifier in their name to define a collation
sequence that has a total ordering of all characters (thus reducing
the problem to "special" locales and user-defined locales), and by
modifying the requirements for regular expressions and affected
utilities so that they cope better with such locales.  As an
intermediate step, it is proposed that the new requirements slated
for Issue 8 are recommended (or at least allowed) in TC2.


The necessary changes will be split across four Mantis bugs, targeting
XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8.  This bug contains the
changes proposed for XCU in Issue 8.

Desired Action: 
After applying the bug http://austingroupbugs.net/view.php?id=963 changes at
each of the following
locations, make further changes to the new text as noted below.
(There is also a change to <i>localedef</i> inserted among the changes
derived from bug 963.)

On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename
Expansion

In the updated list item 3, change from:

any filenames or pathnames that collate equally should be further compared
byte-by-byte using the collating sequence for the POSIX locale.

to:

any filenames or pathnames that collate equally shall be further compared
byte-by-byte using the collating sequence for the POSIX locale.

and delete the small-font note:

<small>Note: a future version of this standard may require the byte-by-byte
further comparison described above.</small>

On Page: 2459 Line: 78745 Section: awk

In the updated text, change from:

For the "!=" and "==" operators, the strings should be compared to check if
they are identical but may be compared using the locale-specific collation
sequence to check if they collate equally.

to:

For the "!=" and "==" operators, the strings shall be compared to check if
they are identical (not to check if they collate equally).

On Page: 2478 Line: 79587 Section: awk

Change the two new APPLICATION USAGE paragraphs from:

On implementations where the "==" operator checks if strings collate
equally, applications needing to check whether strings are identical can
use:<blockquote><pre>length(a) == length(b) && index(a,b) ==
1</pre></blockquote>On implementations where the "==" operator checks if
strings are identical, applications needing to check whether strings
collate equally can use:<blockquote><pre>a <= b && a >=
b</pre></blockquote>to:

Since the "==" operator checks whether strings are identical, not whether
they collate equally, applications needing to check whether strings
collate equally can use:<blockquote><pre>a <= b && a >=
b</pre></blockquote>
On Page: 2486 Line: 79914 Section: awk

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require the "!=" and "==" operators
to perform string comparisons by checking if the strings are identical (and
not by checking if they collate equally).

to:

None.

On Page: 2559 Line: 82755 Section: comm

Change the new DESCRIPTION paragraph from:

If the collating sequence of the current locale does not have a total
ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the
input files collate equally but are not identical, <i>comm</i> should treat
them as different lines but may treat them as being the same.  If it treats
them as different, <i>comm</i> should expect them to be ordered according
to a further byte-by-byte comparison using the collating sequence for the
POSIX locale and if they are not ordered in this way, the output of
<i>comm</i> can identify such lines as being both unique to <i>file1</i>
and unique to <i>file2</i> instead of being in both files.

to:

If the collating sequence of the current locale does not have a total
ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the
input files collate equally but are not identical, <i>comm</i> shall treat
them as different lines and shall expect them to be ordered according to a
further byte-by-byte comparison using the collating sequence for the POSIX
locale; if they are not ordered in this way, the output of <i>comm</i> can
identify such lines as being both unique to <i>file1</i> and unique to
<i>file2</i> instead of being in both files.

On Page: 2560 Line: 82810 Section: comm

In the updated text, change from:

If the input files contained any lines that collated equally but were not
identical and within each file those lines were ordered according to a
further byte-by-byte comparison using the collating sequence for the POSIX
locale, and <i>comm</i>treated them as different lines, then lines written
that collate equally but are not identical should be ordered according to a
further byte-by-byte comparison using the collating sequence for the POSIX
locale.

to:

If the input files contained any lines that collated equally but were not
identical and within each file those lines were ordered according to a
further byte-by-byte comparison using the collating sequence for the POSIX
locale, then lines written that collate equally but are not identical shall
be ordered according to a further byte-by-byte comparison using the
collating sequence for the POSIX locale.

On Page: 2561 Line: 82825 Section: comm

Change the new APPLICATION USAGE paragraphs from:

If the collating sequence of the current locale does not have a total
ordering of all characters, this can affect the behaviour of <i>comm</i> in
the following ways:<blockquote>* If <i>comm</i> treats lines as being the
same only if they are identical, some lines can be misleadingly identified
as being both unique to <i>file1</i> and unique to <i>file2</i>.

* If <i>comm</i> treats lines as being the same if they collate equally and
a line from <i>file1</i> collates equally with a line from <i>file2</i> but
is not identical to it, one of the lines is misleadingly identified as
being in both files and the other is not written to the output at
all.</blockquote>Such problems can be avoided by forcing the use of the
POSIX locale, for example the following identifies lines in both
<i>file1</i> and <i>file2</i>:<blockquote><pre>LC_ALL=POSIX sort file1 >
file1.posix
LC_ALL=POSIX sort file2 > file2.posix
LC_ALL=POSIX comm -12 file1.posix file2.posix | sort
</pre></blockquote>The final <i>sort</i> re-sorts the output of <i>comm</i>
according to the collating sequence of the original locale.  Doing this
might be difficult if more than one column is output and leading blanks
cannot be ignored.

to:

If the collating sequence of the current locale does not have a total
ordering of all characters, since <i>comm</i> treats lines as being the
same only if they are identical, some lines can be misleadingly identified
as being both unique to <i>file1</i> and unique to <i>file2</i> if lines
that collate equally but are not identical are not ordered in the way that
<i>comm</i> expects.  If the input does not come from utilities (such as
<i>ls</i> and <i>sort</i>) which provide this ordering, the problem can be
avoided by pre-sorting the input files using <i>sort</i>.

On Page: 2561 Line: 82842 Section: comm

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require that if any lines from the
input files collate equally but are not identical, then <i>comm</i> treats
them as different lines and expects them to be ordered according to a
further byte-by-byte comparison using the collating sequence for the POSIX
locale.

A future version of this standard may require that if the input files
contained any lines that collated equally but were not identical and within
each file those lines were ordered according to a further byte-by-byte
comparison using the collating sequence for the POSIX locale, then lines
written that collate equally but are not identical are ordered according to
a further byte-by-byte comparison using the collating sequence for the
POSIX locale.

to:

None.

On Page: 2874 Line: 94650 Section: localedef

Add a new paragraph to the DESCRIPTION section:

If the LC_COLLATE category defines a collation sequence that does not have
a total ordering of all characters, <i>localedef</i> shall write a warning
message to standard error and, if the exit status would otherwise have been
zero, shall exit with status 1.

On Page: 2888 Line: 95164 Section: ls

In the new DESCRIPTION paragraph change from:

any filenames or pathnames that collate equally should be further compared
byte-by-byte using the collating sequence for the POSIX locale.

to:

any filenames or pathnames that collate equally shall be further compared
byte-by-byte using the collating sequence for the POSIX locale.

On Page: 2896 Line: 95520 Section: ls

In the FUTURE DIRECTIONS section, delete the new paragraph:

A future version of this standard may require that if the collating
sequence for the current locale does not have a total ordering of all
characters, any filenames or pathnames that collate equally are further
compared byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3210 Line: 107544 Section: sort

In the updated text, change from:

any lines of input that collate equally should be further compared
byte-by-byte using the collating sequence for the POSIX locale.

to:

any lines of input that collate equally shall be further compared
byte-by-byte using the collating sequence for the POSIX locale.

On Page: 3214 Line: 107719 Section: sort

In the updated APPLICATION USAGE text, change from:

If the collating sequence of the current locale does not have a total
ordering of all characters, this can affect the behavior of <i>sort</i> in
the following ways:<blockquote>* As <tt>sort -u</tt> suppresses lines with
duplicate keys, it suppresses lines that collate equally but are not
identical.

* The output of <i>sort</i> (without <b>-u</b>) can contain identical
lines that are not adjacent, if it does not implement the recommended
further byte-by-byte comparison of lines that collate equally.  This
affects the use of <i>sort</i> with <i>comm</i> and <i>uniq</i>; see
the APPLICATION USAGE for those utilities.</blockquote>to:

If the collating sequence of the current locale does not have a total
ordering of all characters, since <tt>sort -u</tt> suppresses lines
with duplicate keys, it suppresses lines that collate equally but are
not identical.

On Page: 3215 Line: 107783 Section: sort

In the new RATIONALE paragraph change from:

Implementations are encouraged to perform the recommended further
byte-by-byte comparison of lines that collate equally, even though this may
affect efficiency.  The impact on efficiency can be mitigated by only
performing the additional comparison if the current locale's collating
sequence does not have a total ordering of all characters (if the
implementation provides a way to query this) or by only performing the
additional comparison if the locale name associated with the LC_COLLATE
category has an '@' modifier in the name (since locales without an '@'
modifier should have a total ordering of all characters - see [xref to XBD
7.3.2]).  Note that if the implementation provides a <i>stable sort</i>
option as an extension (usually -<b>s</b>), the additional comparison
should not be performed when this option has been specified.

to:

The required further byte-by-byte comparison of lines that collate equally
may have an impact on efficiency, but this can be mitigated by only
performing the additional comparison if the current locale's collating
sequence does not have a total ordering of all characters (if the
implementation provides a way to query this) or by only performing the
additional comparison if the locale name associated with the LC_COLLATE
category has an '@' modifier in the name (since implementation-supplied
locales without an '@' modifier have a total ordering of all characters -
see [xref to XBD 7.3.2] - and <i>localedef</i> users are warned to follow
the same convention).  Note that if the implementation provides a <i>stable
sort</i> option as an extension (usually -<b>s</b>), the additional
comparison should not be performed when this option has been specified.

On Page: 3215 Line: 107785 Section: sort

Change the updated FUTURE DIRECTIONS section from:

A future version of this standard may require that if the collating
sequence of the current locale does not have a total ordering of all
characters, any lines of input that collate equally when comparing them as
whole lines are further compared byte-by-byte using the collating sequence
for the POSIX locale.

to:

None.

On Page: 3310 Line: 111099 Section: uniq

In the updated APPLICATION USAGE section, change from:

If the collating sequence of the current locale has a total ordering of all
characters, the <i>sort</i> utility can be used to cause repeated lines to
be adjacent in the input file.  If the collating sequence does not have a
total ordering of all characters, the <i>sort</i> utility should still do
this but it might not.  To ensure that all duplicate lines are eliminated,
and have the output sorted according the collating sequence of the current
locale, applications should use:<blockquote><pre>LC_ALL=C sort -u |
sort</pre></blockquote>instead of:<blockquote><pre>sort |
uniq</pre></blockquote>To remove duplicate lines based on whether they
collate equally instead of whether they are identical, applications should
use:<blockquote><pre>sort -u</pre></blockquote>instead
of:<blockquote><pre>sort | uniq</pre></blockquote>to:

The <i>sort</i> utility can be used to cause repeated lines to be adjacent
in the input file.

If the collating sequence of the current locale does not have a total
ordering of all characters, the behavior of <tt>sort | uniq</tt> differs
from <tt>sort -u</tt>, as <i>uniq</i> treats lines as duplicates only if
they are identical, whereas <tt>sort -u</tt> treats lines as duplicates if
they collate equally.

====================================================================== 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2016-08-25 11:11 geoffclare     New Issue                                    
2016-08-25 11:11 geoffclare     Name                      => Geoff Clare     
2016-08-25 11:11 geoffclare     Organization              => The Open Group  
2016-08-25 11:11 geoffclare     Section                   => 2.13.3, awk, comm,
localedef, ls, sort, uniq
2016-08-25 11:11 geoffclare     Page Number               => 2356, 2459, 2559,
2874, 2888, 3210, 3309, and more
2016-08-25 11:11 geoffclare     Line Number               => 75082, 78745,
82755, 94650, 95164, 107544, 111067, and more
2016-08-25 11:11 geoffclare     Interp Status             => ---             
======================================================================

[1003.1(2013)/Issue7+TC1 0001070]: Collation issues in XCU (changes for Issue 8)

Reply via email to