[content warning: yet another long software engineering rant] At 2022-08-02T13:38:22+0200, Alejandro Colomar wrote: > On 7/27/22 15:23, Douglas McIlroy wrote: > > > > Incidentally, I personally don't use NULL. Why, when C provides a > > crisp notation, 0, should one want to haul in an extra include file > > to activate a shouty version of it?
While I don't endorse the shoutiness of the name selected for the macro (and also don't endorse the use of a macro over making the null pointer constant a bona fide language feature as was done in Pascal[1]), as stronger typing has percolated into the C language--slowly, and against great resistance from those who defend its original character as a portable assembly language--distinctions have emerged between ordinal types, reference types (pointers), and Booleans. I feel this is unequivocally a good thing. Yes, people will say that a zero (false) value in all of these types has exactly the same machine representation, even when it's not true[2], so a zero literal somehow "tells you more". But let me point out two C programming practices I avoid and advocate against, and explain why writing code that way tell us less than many of us suspect. (1) Setting Booleans like this. nflag++; First let me point out how screamingly unhelpful this variable name is. I know the practice came from someone in Bell Labs and was copied with slavish dedication by many others, so I'm probably slandering a deity here, but this convention is _terrible_. It tells you almost _nothing_. What does the "n" flag _mean_? You have to look it up. It is most useful for those who already know the program's manual inside out. Why not throw a bone to people who don't, who just happen across the source code? numbering++; Uh-oh, that's more typing. And if I got my way you'd type even more. want_numbering++; There, now you have something that actually _communicates_ when tested in an `if` statement. But I'm not done. The above exhibit abuses the incrementation operator. Firstly this makes your data type conceptually murky. What if `nflag` (or whatever you called it) was already `1`? Now it's `2`. Does this matter? Are you sure? Have you checked every path through your code? (Does your programming language help you to do this? Not if it's C. <bitter laugh>) Is a Boolean `2` _meaningful_, without programming language semantics piled on to coerce it? No. Nobody answers "true or false" questions in any other domain of human experience with "2". Nor, really, with "0" or "1", except junior programmers who think they're being witty. This is why the addition of a standard Boolean type to C99 was an unalloyed good. Kernighan and Plauger tried to convince people to select variable names with a high information content in their book _The Elements of Programming Style_. Kernighan tried again with Pike in _The Practice of Programming_. It appears to me that some people they knew well flatly rejected this advice, and _bad_ programmers have more diligently aped poor examples than good ones. (To be fair, fighting this is like fighting entropy. Ultimately, you will lose.) I already hear another objection. "But the incrementation operator encodes efficiently!" This is a reference to how an incrementation operation is a common feature of instruction set architectures, e.g., "INC A". By contrast, assigning an explicit value, as in "ADD A, 1", is in some machine languages a longer instruction because it needs to encode not just the operation and a destination register but an immediate operand. There are at least two reasons to stop beating this drum. (A) Incredibly, a compiler can be written such that it recognizes when a subexpression in an addition operation equals one, and can branch in its own code to generate "INC A" instead of "ADD A, 1". Yes, in the days when you struggled to fit your compiler into machine memory this sort of straightforward conditional might have been skipped. (I guess.) Those days are past. (B) If you want to talk about the efficiency of instruction encoding you need to talk about fetching. Before that, actually, you need to talk about whether the ISA you're targeting uses constant- or variable-length instructions. The concept of a constant-length instruction encoding should ALREADY give the promulgators of the above quoted doctrine serious pause. How much efficiency are you gaining in such a case, Tex? Quantify how much "tighter" the machine code is. Tell the whole class. Let's assume we're on a variable-length instruction machine, like the horrible, and horribly popular, x86. Let's get back to fetching. Do you know the memory access impact of (the x86 equivalent of) "INC A" vs. "ADD A, 1"?[3] Are you taking instruction cache into account? Pipelining? Speculative execution, that brilliant source of a million security exploits? (Just understanding the basic principle of how speculative execution works should, again, give pause to the C programmer concerned with the efficiency of instruction encoding. Today's processors cheerfully choke their cache lines with _instructions whose effects they know they will throw away_.[4]) If you can't say "yes" to all of these, have the empirical measurements to back them up, had those measurements' precision and _validity_ checked by critical peers, _and_ put all this information in your code comments, then STOP. As a person authoring a program, the details of how a compiler translates your code is seldom your business. Yes, knowledge of computer architecture and organization is a tremendous benefit, and something any serious programmer should acquire--that's why we teach it in school--but _assuming_ that you know with precision how the compiler is going to translate your code is a bad idea. If you're concerned about it, you must _check_ your assumption about what is happening, and _measure_ whether it's really that important before doing anything clever in your code to leverage what you find out. If it _is_ important, then apply your knowledge the right way: acknowledge in a comment that you're working around an issue, state what it is, explain how what you're doing instead is effective, and as part of that comment STATE CLEARLY THE CONDITIONS UNDER WHICH THE WORKAROUND CAN BE REMOVED. The same thing goes for optimizations. Oh, let me rant about optimizations. First I'll point the reader to this debunking[5] of the popularly misconstrued claim of Tony Hoare about it (the "root of all evil" thing), which also got stuffed into Donald Knuth's mouth, I guess by people who not only found the math in TAOCP daunting to follow, but also struggled with the English. (More likely, they thoughtlessly repeated what some rock star cowboy programmer said to them in the break room or in electronic chat.) In my own experience the best optimizations I've done are those which dropped code that wasn't doing anything useful at all. Eliminating _unnecessary_ work is _never_ a "premature" optimization. It removes sites for bugs to lurk and gives your colleagues less code to read. (...says the guy who thinks nothing of dropping a 20-page email treatise on them at regular intervals.) And another thing! I'd say that our software engineering curriculum is badly deficient in teaching its students about linkage and loading. As a journeyman programmer I'd even argue that it's more important to learn about this than comp arch & org. Why? Because every _successful_ program you run will be linked and loaded. And because you're more likely to have problems with linkage or loading than with a compiler that translates your code into the wrong assembly instructions. Do you need both? Yes! But I think in pedagogy we tend to race from "high-level" languages to assembly and machine languages without considering question of how, in a hosted environment, programs _actually run_. (2) char *myptr = 0; Yes, we're back to the original topic at last. A zero isn't just a zero! There is something visible in assembly language that isn't necessarily so in C code, and that is the addressing mode that gets applied to the operand! Consider the instruction "JP (HL)". We know from the assembler syntax that this is going to jump to the address stored in the HL register. So if HL contains zero, it will jump to address zero. As it happens, this was a perfectly valid instruction and operation to perform back when I was kid (it would simulate a machine reset). So people talk even to this day about C being a portable assembly language, but I think they aren't fully frank about what gets concealed in the process. Addressing modes are inherently architecture-specific but also fundamental. I submit that using `0` as a null pointer constant, whether explicitly or behind the veil of a preprocessor macro, hides _necessary_ information from the programmer. For me personally, this fact alone is enough to justify a within-language null pointer constant. Maybe people will find the point easier to swallow if they're asked to defend why they ever use an enumeration constant that is equal to zero instead of a literal zero. I will grant--some of them, particularly users of the Unix system call interface--often don't. But, if you do, and you can justify it, then you can also justify "false" and "nullptr". (I will leave my irritation with the scoping of enumeration constants for another time. Why, why, why, was the '.' operator not used as a value selector within an enum type or object, and 'typedef' employed where it could have done some good for once?) I don't think it is an accident that there are no function pointer literals in the language either (you can of course construct one through casting, a delightfully evil bit of business). The lack made it easier to paper over the deficiency. Remember that C came out of the Labs without fully-fledged struct literals ("compound literals"), either. If I'd been at the CSRC--who among us hasn't had that fantasy?--I'd have climbed the walls over this lack of orthogonality. I will grant that the inertia against the above two points was, and remains, strong. C++, never a language to reject a feature[6], resisted the simple semantics and obvious virtues of a Boolean data type for nearly the first 20 years of its existence, and only acquired `nullptr` in C++11. Prior to that point, C++ was militant about `0` being the null pointer constant, in line with Doug's preference--none of this shouty "NULL" nonsense. I cynically suspect the feature-resistance on these specific two points as being a means of reinforcing people's desires, prejudices, or sales pitches regarding C++ as a language that remained "close to the machine", because they were easy to conceptualize and talk about. And of course anything C++ didn't need, C didn't need either, because leanness. Eventually, I guess, WG21 decided that the battle front over that claim was better defended elsewhere. Good! > Because I don't know what foo(a, b, 0, 0) is, and I don't know from > memory the position of all parameters to the functions I use (and > checking them every time would be cumbersome, although I normally do, > just because it's easier to just check, but I don't feel forced to do > it so it's nicer). Kids these days will tell you to use an IDE that pops up a tool tip with the declaration of 'foo'. That this is necessary or even as useful as it is discloses problems with API design, in my opinion. > Was the third parameter to foo() the pointer and the fourth a length, > or was it the other way around? bcopy(3) vs memcpy(3) come to mind, > although of course no-one would memcpy(dest, NULL, 0) with hardcoded > 0s in their Right Mind (tm) (although another story is to call > memcpy(dest, src, len) with src and len both 0). The casual inconsistency of the standard C library has many more manifestations than that. > Knowing with 100% certainty that something is a pointer or an integer > just by reading the code and not having to go to the definition > somewhere far from it, improves readability. I entirely agree. > Even with a program[...] that finds the definitions in a big tree of > code, it would be nice if one wouldn't need to do it so often. Or like Atom[7], which has been a big hit among Millennial programmers. Too bad, thanks to who acquired GitHub, it's being killed off and replaced by Visual Studio now. We call this market efficiency. > Kind of the argument that some use to object to my separation of > man3type pages, but for C code. Here I disagree, because while we are clearly aligned on the null pointer literal...er, point, I will continue to defend the position that a man page for a header file--where an API has been designed rather than accreted--is a preferable place to document the data types and constants it defines, as well as either an overview or a detailed presentation of its functions, depending on its size and/or complexity. I think the overwhelming importance of the C library to your documentary mission in the Linux man-pages project is obscuring the potential of elegant man pages for _other_ libraries. What passes for the C standard library on _any_ system is quite a crumbcake. The fact that you can't cram the C library into the shape I espouse without making the pages huge and unwieldy is the fault of the library, I humbly(!) suggest, not the fault of the principle. But I'd be a long way from the first person to note that the C library is schizophrenic upon even cursory inspection. Is it true that Plauger never wrote another book after his (dense but incredibly illuminating, A+++, would read again) _The Standard C Library_? Did writing his own libc and an exposition of it "break" him in some way? I imagine that he got perhaps as good a look as anyone's ever had at how much better a standard library C could have had, if only attention had been paid at the right historical moments. He saw how entrenched the status quo was, and decided to go sailing...forever. Just an irresponsible guess. I would like to find a good place to state the recommendation that documentors of _other_ libraries should not copy this approach of trifurcating presentations of constants, data types, and functions. In a well-designed API, these things will be clearly related, mechanically coupled, and should be considered together. What do you think? As always, spirited contention of any of my claims is welcome. Regards, Branden [1] I say that, but I don't see it in the 1973 "Revised Report" on the language. I guess "nil" came in later. Another Algol descendant, Ada 83, had "null". And boy am I spoiling to fight with someone over the awesomeness of Ada, the most unfairly maligned of any language known to me. [2] Like some CDC Cyber machines, I gather. Before my time. [3] Pop quiz: what assembly language did Branden grow up on? It's hard to escape our roots in some ways. [4] I won't go so far as to say speculative execution is _stupid_, though I wonder if it can ever actually be implemented without introducing side channels for attack. Spec ex is performed, as I understand it, because Moore's Law is dead, and CPU manufacturers are desperately trying to satisfy sales engineers and certain high-volume customers who are only willing to pay for cores that the premium prices (profit margins) that the industry has been accustomed to for 50 years. Purchasing decisions are made by suits, and suits like to get to the bottom line, like "number go up". https://davidgerard.co.uk/blockchain/2019/05/27/the-origin-of-number-go-up-in-bitcoin-culture/ [5] https://ubiquity.acm.org/article.cfm?id=1513451 [6] Stroustrup would disagree and can support his argument; see his writings on the history of the language. [7] https://en.wikipedia.org/wiki/Atom_(text_editor)
signature.asc
Description: PGP signature