A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1942 ====================================================================== Reported By: dwheeler Assigned To: ajosey ====================================================================== Project: 1003.1(2024)/Issue8 Issue ID: 1942 Category: Shell and Utilities Type: Enhancement Request Severity: Objection Priority: normal Status: Under Review Name: David A. Wheeler Organization: User Reference: diff Section: diff Page Number: 1 Line Number: 1 Interp Status: --- Final Accepted Text: ====================================================================== Date Submitted: 2025-08-31 22:16 UTC Last Modified: 2025-10-05 00:19 UTC ====================================================================== Summary: Add common options to diff ======================================================================
---------------------------------------------------------------------- (0007283) dwheeler (reporter) - 2025-10-05 00:19 https://www.austingroupbugs.net/view.php?id=1942#c7283 ---------------------------------------------------------------------- Here are the questions and my attempted answers, followed by partially-updated proposal for diff. This is NOT a completed proposal; I just wanted to share my progress so far. That said, any comments on this work-in-progress is appreciated. My baseline is the POSIX 2024 diff specification visible at: https://pubs.opengroup.org/onlinepubs/9799919799/utilities/diff.html and the proposals are here at https://www.austingroupbugs.net/view.php?id=1942 My proposal adds the following options to diff: <pre> -a (all files as text), -B (blank lines ignored), -d (aggressive diffing), -i (ignore case difference), -I regexp (ignore these lines), -N (missing is new), -q (brief report if differ), -s (report if same), -T (tabs), -w (whitespace), -x pattern (exclude files), -X file (exclude patternfile). </pre> I've learned that the -T option has some fiddly details to be addressed that are not fully addressed here (yet). Regarding the questions raised: 1. If standard input is an input (e.g., "-"), what filename is reported? Is it "(standard input)" everywhere, or something else? Ideally there would be a standard answer, and if not, it should be clearly noted that it may vary by implementation. I'll need to see what implementations do in this case. The current spec already says for -c or -C, "The pathname written for standard input is unspecified." It also already says that for -u or -U, "Each <file> field shall be the pathname of the corresponding file being compared, or the single character '-' if standard input is being compared." I don't intend to change that. So I think the key issue is the new -q and -s options for "-". I determined that "-" appears to be the answer in implementations. It's also important to determine if locale matters. I also installed the French (fr_FR.UTF-8) on my Debian system to further test the GNU and Busybox versions, and I saw no differences. Here is my justification. I ran this to test -q on Linux (GNU), Busybox, and FreeBSD: <pre> seq 1 30 > seq1-30.txt export LANG=fr_FR.UTF-8 export LC_ALL=fr_FR.UTF-8 seq 1 25 | diff -q - seq1-30.txt 2> /dev/null </pre> This produced exactly "Files - and seq1-30.txt differ" on GNU diff, Busybox diff ("busybox diff..."), and FreeBSD diff - even in the French locale. To test -s, I ran: <pre> seq 1 30 | diff -s - seq1-30.txt 2> /dev/null </pre> This produced exactly "Files - and seq1-30.txt are identical" on GNU diff, Busybox diff, and FreeBSD diff, even in the French locale. I did find a Busybox bug. The command "diff -s seq1-30.txt seq1-30.txt" produced the expected results on GNU diff and FreeBSD, namely "Files seq1-30.txt and seq1-30.txt are identical". However, when the *same* filename is given to Busybox diff it produces NOTHING. I have no objections to optimizing computations (like noticing the same file was sent), but I think this is a clear bug in busybox. "Tell me if the files are the same" should do exactly that. Busybox has no problems providing the expected output when given *different* files: <pre> cp seq1-30.txt seq1-30-dup.txt $ busybox diff -s seq1-30.txt seq1-30-dup.txt Files seq1-30.txt and seq1-30-dup.txt are identical </pre> I think we should standardize common and expected behavior, and report a bug to Busybox. Normally you don't compare files to <i>exactly</i> themselves anyway, so this is probably not a bug anyone has encountered in real life. It would be <i>possible</i> to make this case implementation-defined, but I think this is a bug and should be treated as such. 2. What's the impact of -q and -s? Do they send to standard output? How do they interact? Are they locale-dependent? Let me answer the question as I originally understood it, and then reply to the later explanation, in the hopes that I fully answer the question. Yes, their outputs are sent to standard out (I checked by redirecting stderr from multiple implementations). No, they're not locale-dependent, the output is the same (presumably the intention is to aid scripts). I tried this no GNU and FreeBSD with French locale fr_FR.UTF-8. The -q and -s flags interact in the "obvious" way required by their definitions. Basically, if you use both, you always have an output when comparing 2 files, and it indicates if they differ or are the same. I don't think that needs special documentation. I think the problem is that my earlier description wasn't clear enough, so I rewrote it hopefully be clear. Here's what they do: <pre> $ diff -qs seq1-30.txt seq1-30-dup.txt Files seq1-30.txt and seq1-30-dup.txt are identical $ diff -qs seq1-30.txt seq1-25.txt Files seq1-30.txt and seq1-25.txt differ </pre> The "-s" only does anything different when two compared files are considered the same. The files may have <i>some</i> differences, e.g., per the -B flag, but what matters is whether or not they're considered different by diff. I'll clarify that in the options text. geoffclare later clarified: "Re 0001942:0007257 item 2, the desired action has statements about what is written to standard output for -q and -s. The point I made in the meeting was that those details should be in the STDOUT section, not in the option descriptions. Also, specific English text should only be required for the POSIX locale." Sorry I misunderstood. Sure, I'll do that. 3. The "-i" (Ignore Case) should refer to Refer to XPD 4.2 - general concepts (case-insensitive) (spelling?). Review the similar references to use the same format. Agreed. Done. I used grep as my template. 4. Does "-w" ignore space and tab, or ignore whitespace? The initial draft was inconsistent. There's other whitespace than space or tab. Is it locale-specific? I tested on GNU, Busybox, and FreeBSD. The "-w" consistently ignores *only* these whitespace characters: U+9 (TAB), U+B (VT), U+C (FF), U+D (CR), U+20 (SPACE). It does not ignore *any* of the other Unicode whitespace characters (there are 25 in the current spec). I can't say I tried it on all locales, but I tried it with French (fr_FR.UTF-8) and I saw no evidence of locale dependence. The "-B" only ignores *fully* blank lines by itself (portably). The *combination* of -Bw has an annoying implementation difference: - GNU diff: Treats lines with only "-w" whitespace as equivalent to an empty line (and thus ignored) - FreeBSD diff: Does NOT treat lines with only "-w" whitespace as equivalent to an empty line (and thus they are still compared) This difference only seems to happen when the options are *combined*, so I documented combination as implemention-defined behavior with the two options identified. See this test script if you want to investigate this: http://dwheeler.com/misc/diff-whitespace-test.sh Here are a few additional notes. The GNU documentation for -q (--brief) says, "report only when files differ". That text is misleading, ecause diff *normally* only reports when files differ, yet the text implies otherwise. The FreeBSD documentation is a little clearer: "Just print a line when the files differ. Does not output a list of changes." GNU and FreeBSD differ on what -T does when there's no change. FreeBSD outputs space-tab (no change, then indent). GNU outputs a tab (merging the 'no change' space and the indent). To see this you need to use od -c or similar, since visually they look the same. I think that could be "implementation-defined" without serious issue, it's a little messy but it's also reality, and it really isn't hard to handle programmatically once you know it can happen. * * * <b>INCOMPLETE proposed changes to POSIX diff specification:</b> Synopsis Section: Change from: <pre>diff [-c|-e|-f|-u|-C n|-U n] [-br] file1 file2</pre> To: <pre> diff [-c|-e|-f|-u|-C n|-U n] [-abdiqsNrTw] [-I regexp] [-X file] [-x pattern] file1 file2 </pre> Description Section: Change "This list should be minimal." to "This list should be reasonably minimal." because really, that's all you can hope for. Options Section: Add the following options in alphabetical order in addition to existing options: <b>-a</b> Treat all files as text. Files that would otherwise be identified as binary files shall be treated as text files. <b>-B</b> Ignore lines that are blank. A blank line is a line that is empty (contains no characters). <b>-d</b> Use a more aggressive algorithm to minimize the number of changes in the output. This may require significantly more time and memory. <b>-i</b> Compare lines in a case-insensitive manner (using LC_CTYPE); see XBD 9.2 Regular Expression General Requirements. <b>-I regexp</b> Ignore lines in both files that match the Extended Regular Expression regexp. Multiple -I options may be specified; lines matching any of the patterns shall be ignored. Perform pattern matching in a case-insensitive manner; see XBD 9.2 Regular Expression General Requirements. <b>-N</b> If file1 or file2 is a directory and the other is not, or if one file is missing during directory comparison, treat the missing file as an empty file. <b>-q</b> If files have a reportable difference, output only that they differ instead of the details about their differences. By default all differences in files are reported, but options can change this (see -i, -I, -B, and -w). <b>-s</b> If files are considered the same (do not have a reportable difference), report that they are the same instead of being silent. <b>-T</b> Write a tab instead of a space before the line information about differences (to make tab alignment consistent). <b>-w</b> Ignore differences in sequences of equivalent whitespace when comparing lines. The following characters are treated as equivalent whitespace: <space> (U+20), <tab> (U+9), vertical tab (U+B), form feed (U+C), and carriage return (U+D). Any sequence of one or more of these characters shall be considered equivalent to any other such sequence of one or more such characters. Other whitespace characters are not treated as equivalent. <b>-x pattern</b> During recursive directory comparison, exclude files and directories whose basename matches the shell pattern specified by pattern. Multiple -x options may be specified. Pattern matching follows the rules specified in XBD Pattern Matching Notation. <b>-X file</b> During recursive directory comparison, exclude files and directories whose basenames match any pattern in file. Each line in file shall be treated as a shell pattern following the same matching rules as -x. Note: The interaction between -B and -w options when applied together (-Bw) is implementation-defined. An implementation may or may not consider lines containing only the whitespace characters of -w as blank lines when both options are used together. <b>In the later "STDOUT" section:</b> BEFORE the subsection "Diff Default Output Format" add this text and these two subsections: By default "indent" is a space character; with -T it becomes a tab character. <b>Diff brief considered different form</b> (added) If the -q option is specified and the compared files are considered different (have reportable differences), a diagnostic line is written to standard output to note that there are differencs instead of describing those differences. In the POSIX locale, the following format is written in this case: "Files %s and %s differ\n", <filename1>, <filename2> <b>Diff considered same form</b> (added) If the -s option is specified and the compared files are considered the same (have no reportable differences), then instead of no output, a diagnostic line is written to standard output to the note that they are the same. In the POSIX locale, the following format is written in this case: "Files %s and %s are identical\n", <filename1>, <filename2> <b>Diff Default Output Format</b> Change: "The default (without -e, -f, -c, -C, -u, or -U options) diff utility output shall contain lines of these forms:" to: "The default (without -e, -f, -q, -c, -C, -u, or -U options) diff utility output shall contain lines of these forms where there are reportable differences:" ... TODO: the output section must be modified to handle -T. It shouldn't add much length, but there will be many small changes. Again, this is work in progress, not a complete proposal, but I wanted to share what I've learned so far. Issue History Date Modified Username Field Change ====================================================================== 2025-08-31 22:16 dwheeler New Issue 2025-08-31 22:16 dwheeler Status New => Under Review 2025-08-31 22:16 dwheeler Assigned To => ajosey 2025-09-01 15:08 dwheeler Note Added: 0007248 2025-09-11 15:50 geoffclare Project 1003.1(2008)/Issue 7 => 1003.1(2024)/Issue8 2025-09-12 14:43 dwheeler Note Added: 0007257 2025-09-12 14:44 dwheeler Note Edited: 0007257 2025-09-12 14:45 dwheeler Note Edited: 0007257 2025-09-18 16:00 geoffclare Note Added: 0007268 2025-09-18 16:01 geoffclare Note Edited: 0007268 2025-10-05 00:19 dwheeler Note Added: 0007283 ======================================================================
