The following issue has been SUBMITTED. ====================================================================== http://austingroupbugs.net/view.php?id=1070 ====================================================================== Reported By: geoffclare Assigned To: ====================================================================== Project: 1003.1(2013)/Issue7+TC1 Issue ID: 1070 Category: Shell and Utilities Type: Error Severity: Objection Priority: normal Status: New Name: Geoff Clare Organization: The Open Group User Reference: Section: 2.13.3, awk, comm, localedef, ls, sort, uniq Page Number: 2356, 2459, 2559, 2874, 2888, 3210, 3309, and more Line Number: 75082, 78745, 82755, 94650, 95164, 107544, 111067, and more Interp Status: --- Final Accepted Text: ====================================================================== Date Submitted: 2016-08-25 11:11 UTC Last Modified: 2016-08-25 11:11 UTC ====================================================================== Summary: Collation issues in XCU (changes for Issue 8) Description: A discussion on the mailing list identified some issues related to collation for locales that do not define a collation sequence with a total ordering of all characters. It is proposed that these issues are addressed in Issue 8 by requiring implementation-provided locales that do not have an '@' modifier in their name to define a collation sequence that has a total ordering of all characters (thus reducing the problem to "special" locales and user-defined locales), and by modifying the requirements for regular expressions and affected utilities so that they cope better with such locales. As an intermediate step, it is proposed that the new requirements slated for Issue 8 are recommended (or at least allowed) in TC2.
The necessary changes will be split across four Mantis bugs, targeting XBD TC2, XCU TC2, XBD Issue 8, and XCU Issue 8. This bug contains the changes proposed for XCU in Issue 8. Desired Action: After applying the bug http://austingroupbugs.net/view.php?id=963 changes at each of the following locations, make further changes to the new text as noted below. (There is also a change to <i>localedef</i> inserted among the changes derived from bug 963.) On Page: 2356 Line: 75082 Section: 2.13.3 Patterns Used for Filename Expansion In the updated list item 3, change from: any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. and delete the small-font note: <small>Note: a future version of this standard may require the byte-by-byte further comparison described above.</small> On Page: 2459 Line: 78745 Section: awk In the updated text, change from: For the "!=" and "==" operators, the strings should be compared to check if they are identical but may be compared using the locale-specific collation sequence to check if they collate equally. to: For the "!=" and "==" operators, the strings shall be compared to check if they are identical (not to check if they collate equally). On Page: 2478 Line: 79587 Section: awk Change the two new APPLICATION USAGE paragraphs from: On implementations where the "==" operator checks if strings collate equally, applications needing to check whether strings are identical can use:<blockquote><pre>length(a) == length(b) && index(a,b) == 1</pre></blockquote>On implementations where the "==" operator checks if strings are identical, applications needing to check whether strings collate equally can use:<blockquote><pre>a <= b && a >= b</pre></blockquote>to: Since the "==" operator checks whether strings are identical, not whether they collate equally, applications needing to check whether strings collate equally can use:<blockquote><pre>a <= b && a >= b</pre></blockquote> On Page: 2486 Line: 79914 Section: awk Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require the "!=" and "==" operators to perform string comparisons by checking if the strings are identical (and not by checking if they collate equally). to: None. On Page: 2559 Line: 82755 Section: comm Change the new DESCRIPTION paragraph from: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, <i>comm</i> should treat them as different lines but may treat them as being the same. If it treats them as different, <i>comm</i> should expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale and if they are not ordered in this way, the output of <i>comm</i> can identify such lines as being both unique to <i>file1</i> and unique to <i>file2</i> instead of being in both files. to: If the collating sequence of the current locale does not have a total ordering of all characters (see [xref to XBD 7.3.2]) and any lines from the input files collate equally but are not identical, <i>comm</i> shall treat them as different lines and shall expect them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale; if they are not ordered in this way, the output of <i>comm</i> can identify such lines as being both unique to <i>file1</i> and unique to <i>file2</i> instead of being in both files. On Page: 2560 Line: 82810 Section: comm In the updated text, change from: If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, and <i>comm</i>treated them as different lines, then lines written that collate equally but are not identical should be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. to: If the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical shall be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. On Page: 2561 Line: 82825 Section: comm Change the new APPLICATION USAGE paragraphs from: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behaviour of <i>comm</i> in the following ways:<blockquote>* If <i>comm</i> treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to <i>file1</i> and unique to <i>file2</i>. * If <i>comm</i> treats lines as being the same if they collate equally and a line from <i>file1</i> collates equally with a line from <i>file2</i> but is not identical to it, one of the lines is misleadingly identified as being in both files and the other is not written to the output at all.</blockquote>Such problems can be avoided by forcing the use of the POSIX locale, for example the following identifies lines in both <i>file1</i> and <i>file2</i>:<blockquote><pre>LC_ALL=POSIX sort file1 > file1.posix LC_ALL=POSIX sort file2 > file2.posix LC_ALL=POSIX comm -12 file1.posix file2.posix | sort </pre></blockquote>The final <i>sort</i> re-sorts the output of <i>comm</i> according to the collating sequence of the original locale. Doing this might be difficult if more than one column is output and leading blanks cannot be ignored. to: If the collating sequence of the current locale does not have a total ordering of all characters, since <i>comm</i> treats lines as being the same only if they are identical, some lines can be misleadingly identified as being both unique to <i>file1</i> and unique to <i>file2</i> if lines that collate equally but are not identical are not ordered in the way that <i>comm</i> expects. If the input does not come from utilities (such as <i>ls</i> and <i>sort</i>) which provide this ordering, the problem can be avoided by pre-sorting the input files using <i>sort</i>. On Page: 2561 Line: 82842 Section: comm Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require that if any lines from the input files collate equally but are not identical, then <i>comm</i> treats them as different lines and expects them to be ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. A future version of this standard may require that if the input files contained any lines that collated equally but were not identical and within each file those lines were ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale, then lines written that collate equally but are not identical are ordered according to a further byte-by-byte comparison using the collating sequence for the POSIX locale. to: None. On Page: 2874 Line: 94650 Section: localedef Add a new paragraph to the DESCRIPTION section: If the LC_COLLATE category defines a collation sequence that does not have a total ordering of all characters, <i>localedef</i> shall write a warning message to standard error and, if the exit status would otherwise have been zero, shall exit with status 1. On Page: 2888 Line: 95164 Section: ls In the new DESCRIPTION paragraph change from: any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any filenames or pathnames that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 2896 Line: 95520 Section: ls In the FUTURE DIRECTIONS section, delete the new paragraph: A future version of this standard may require that if the collating sequence for the current locale does not have a total ordering of all characters, any filenames or pathnames that collate equally are further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3210 Line: 107544 Section: sort In the updated text, change from: any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale. to: any lines of input that collate equally shall be further compared byte-by-byte using the collating sequence for the POSIX locale. On Page: 3214 Line: 107719 Section: sort In the updated APPLICATION USAGE text, change from: If the collating sequence of the current locale does not have a total ordering of all characters, this can affect the behavior of <i>sort</i> in the following ways:<blockquote>* As <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical. * The output of <i>sort</i> (without <b>-u</b>) can contain identical lines that are not adjacent, if it does not implement the recommended further byte-by-byte comparison of lines that collate equally. This affects the use of <i>sort</i> with <i>comm</i> and <i>uniq</i>; see the APPLICATION USAGE for those utilities.</blockquote>to: If the collating sequence of the current locale does not have a total ordering of all characters, since <tt>sort -u</tt> suppresses lines with duplicate keys, it suppresses lines that collate equally but are not identical. On Page: 3215 Line: 107783 Section: sort In the new RATIONALE paragraph change from: Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see [xref to XBD 7.3.2]). Note that if the implementation provides a <i>stable sort</i> option as an extension (usually -<b>s</b>), the additional comparison should not be performed when this option has been specified. to: The required further byte-by-byte comparison of lines that collate equally may have an impact on efficiency, but this can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since implementation-supplied locales without an '@' modifier have a total ordering of all characters - see [xref to XBD 7.3.2] - and <i>localedef</i> users are warned to follow the same convention). Note that if the implementation provides a <i>stable sort</i> option as an extension (usually -<b>s</b>), the additional comparison should not be performed when this option has been specified. On Page: 3215 Line: 107785 Section: sort Change the updated FUTURE DIRECTIONS section from: A future version of this standard may require that if the collating sequence of the current locale does not have a total ordering of all characters, any lines of input that collate equally when comparing them as whole lines are further compared byte-by-byte using the collating sequence for the POSIX locale. to: None. On Page: 3310 Line: 111099 Section: uniq In the updated APPLICATION USAGE section, change from: If the collating sequence of the current locale has a total ordering of all characters, the <i>sort</i> utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence does not have a total ordering of all characters, the <i>sort</i> utility should still do this but it might not. To ensure that all duplicate lines are eliminated, and have the output sorted according the collating sequence of the current locale, applications should use:<blockquote><pre>LC_ALL=C sort -u | sort</pre></blockquote>instead of:<blockquote><pre>sort | uniq</pre></blockquote>To remove duplicate lines based on whether they collate equally instead of whether they are identical, applications should use:<blockquote><pre>sort -u</pre></blockquote>instead of:<blockquote><pre>sort | uniq</pre></blockquote>to: The <i>sort</i> utility can be used to cause repeated lines to be adjacent in the input file. If the collating sequence of the current locale does not have a total ordering of all characters, the behavior of <tt>sort | uniq</tt> differs from <tt>sort -u</tt>, as <i>uniq</i> treats lines as duplicates only if they are identical, whereas <tt>sort -u</tt> treats lines as duplicates if they collate equally. ====================================================================== Issue History Date Modified Username Field Change ====================================================================== 2016-08-25 11:11 geoffclare New Issue 2016-08-25 11:11 geoffclare Name => Geoff Clare 2016-08-25 11:11 geoffclare Organization => The Open Group 2016-08-25 11:11 geoffclare Section => 2.13.3, awk, comm, localedef, ls, sort, uniq 2016-08-25 11:11 geoffclare Page Number => 2356, 2459, 2559, 2874, 2888, 3210, 3309, and more 2016-08-25 11:11 geoffclare Line Number => 75082, 78745, 82755, 94650, 95164, 107544, 111067, and more 2016-08-25 11:11 geoffclare Interp Status => --- ======================================================================