Counterexamples in C programming and library documentation (was: [PATCH v3] NULL.3const: Add documentation for NULL)

G. Branden Robinson Tue, 02 Aug 2022 12:07:08 -0700

[content warning: yet another long software engineering rant]

At 2022-08-02T13:38:22+0200, Alejandro Colomar wrote:
> On 7/27/22 15:23, Douglas McIlroy wrote:
> > 
> > Incidentally, I personally don't use NULL. Why, when C provides a
> > crisp notation, 0, should one want to haul in an extra include file
> > to activate a shouty version of it?

While I don't endorse the shoutiness of the name selected for the macro
(and also don't endorse the use of a macro over making the null pointer
constant a bona fide language feature as was done in Pascal[1]), as
stronger typing has percolated into the C language--slowly, and against
great resistance from those who defend its original character as a
portable assembly language--distinctions have emerged between ordinal
types, reference types (pointers), and Booleans.

I feel this is unequivocally a good thing.

Yes, people will say that a zero (false) value in all of these types has
exactly the same machine representation, even when it's not true[2], so
a zero literal somehow "tells you more".

But let me point out two C programming practices I avoid and advocate
against, and explain why writing code that way tell us less than many of
us suspect.

(1) Setting Booleans like this.

nflag++;

First let me point out how screamingly unhelpful this variable name
is. I know the practice came from someone in Bell Labs and was
copied with slavish dedication by many others, so I'm probably
slandering a deity here, but this convention is _terrible_. It
tells you almost _nothing_. What does the "n" flag _mean_? You
have to look it up. It is most useful for those who already know
the program's manual inside out. Why not throw a bone to people who
don't, who just happen across the source code?

numbering++;

Uh-oh, that's more typing. And if I got my way you'd type even
more.

want_numbering++;

There, now you have something that actually _communicates_ when
tested in an `if` statement.

But I'm not done. The above exhibit abuses the incrementation
operator. Firstly this makes your data type conceptually murky.
What if `nflag` (or whatever you called it) was already `1`? Now
it's `2`. Does this matter? Are you sure? Have you checked every
path through your code? (Does your programming language help you to
do this? Not if it's C. <bitter laugh>) Is a Boolean `2`
_meaningful_, without programming language semantics piled on to
coerce it? No. Nobody answers "true or false" questions in any
other domain of human experience with "2". Nor, really, with "0" or
"1", except junior programmers who think they're being witty. This
is why the addition of a standard Boolean type to C99 was an
unalloyed good.

Kernighan and Plauger tried to convince people to select variable
names with a high information content in their book _The Elements of
Programming Style_. Kernighan tried again with Pike in _The
Practice of Programming_. It appears to me that some people they
knew well flatly rejected this advice, and _bad_ programmers have
more diligently aped poor examples than good ones. (To be fair,
fighting this is like fighting entropy. Ultimately, you will lose.)

I already hear another objection.

"But the incrementation operator encodes efficiently!"

This is a reference to how an incrementation operation is a common
feature of instruction set architectures, e.g., "INC A". By
contrast, assigning an explicit value, as in "ADD A, 1", is in some
machine languages a longer instruction because it needs to encode
not just the operation and a destination register but an immediate
operand. There are at least two reasons to stop beating this drum.

(A) Incredibly, a compiler can be written such that it recognizes
when a subexpression in an addition operation equals one, and can
branch in its own code to generate "INC A" instead of "ADD A, 1".
Yes, in the days when you struggled to fit your compiler into
machine memory this sort of straightforward conditional might have
been skipped. (I guess.) Those days are past.

(B) If you want to talk about the efficiency of instruction encoding
you need to talk about fetching. Before that, actually, you need to
talk about whether the ISA you're targeting uses constant- or
variable-length instructions. The concept of a constant-length
instruction encoding should ALREADY give the promulgators of the
above quoted doctrine serious pause. How much efficiency are you
gaining in such a case, Tex? Quantify how much "tighter" the
machine code is. Tell the whole class.

Let's assume we're on a variable-length instruction machine, like
the horrible, and horribly popular, x86. Let's get back to
fetching. Do you know the memory access impact of (the x86
equivalent of) "INC A" vs. "ADD A, 1"?[3] Are you taking
instruction cache into account? Pipelining? Speculative execution,
that brilliant source of a million security exploits? (Just
understanding the basic principle of how speculative execution works
should, again, give pause to the C programmer concerned with the
efficiency of instruction encoding. Today's processors cheerfully
choke their cache lines with _instructions whose effects they know
they will throw away_.[4])

If you can't say "yes" to all of these, have the empirical
measurements to back them up, had those measurements' precision and
_validity_ checked by critical peers, _and_ put all this information
in your code comments, then STOP.

As a person authoring a program, the details of how a compiler
translates your code is seldom your business. Yes, knowledge of
computer architecture and organization is a tremendous benefit, and
something any serious programmer should acquire--that's why we teach
it in school--but _assuming_ that you know with precision how the
compiler is going to translate your code is a bad idea. If you're
concerned about it, you must _check_ your assumption about what is
happening, and _measure_ whether it's really that important before
doing anything clever in your code to leverage what you find out.
If it _is_ important, then apply your knowledge the right
way: acknowledge in a comment that you're working around an issue,
state what it is, explain how what you're doing instead is
effective, and as part of that comment STATE CLEARLY THE CONDITIONS
UNDER WHICH THE WORKAROUND CAN BE REMOVED. The same thing goes for
optimizations.

Oh, let me rant about optimizations. First I'll point the reader to
this debunking[5] of the popularly misconstrued claim of Tony Hoare
about it (the "root of all evil" thing), which also got stuffed into
Donald Knuth's mouth, I guess by people who not only found the math
in TAOCP daunting to follow, but also struggled with the English.
(More likely, they thoughtlessly repeated what some rock star cowboy
programmer said to them in the break room or in electronic chat.)

In my own experience the best optimizations I've done are those
which dropped code that wasn't doing anything useful at all.
Eliminating _unnecessary_ work is _never_ a "premature"
optimization. It removes sites for bugs to lurk and gives your
colleagues less code to read. (...says the guy who thinks nothing
of dropping a 20-page email treatise on them at regular intervals.)

And another thing!

I'd say that our software engineering curriculum is badly deficient
in teaching its students about linkage and loading. As a journeyman
programmer I'd even argue that it's more important to learn about
this than comp arch & org. Why? Because every _successful_ program
you run will be linked and loaded. And because you're more likely
to have problems with linkage or loading than with a compiler that
translates your code into the wrong assembly instructions. Do you
need both? Yes! But I think in pedagogy we tend to race from
"high-level" languages to assembly and machine languages without
considering question of how, in a hosted environment, programs
_actually run_.

(2) char *myptr = 0;

Yes, we're back to the original topic at last. A zero isn't just a
zero! There is something visible in assembly language that isn't
necessarily so in C code, and that is the addressing mode that gets
applied to the operand!

Consider the instruction "JP (HL)". We know from the assembler
syntax that this is going to jump to the address stored in the HL
register. So if HL contains zero, it will jump to address zero. As
it happens, this was a perfectly valid instruction and operation to
perform back when I was kid (it would simulate a machine reset).

So people talk even to this day about C being a portable assembly
language, but I think they aren't fully frank about what gets
concealed in the process. Addressing modes are inherently
architecture-specific but also fundamental. I submit that using `0`
as a null pointer constant, whether explicitly or behind the veil of
a preprocessor macro, hides _necessary_ information from the
programmer.

For me personally, this fact alone is enough to justify a
within-language null pointer constant.

Maybe people will find the point easier to swallow if they're asked
to defend why they ever use an enumeration constant that is equal to
zero instead of a literal zero. I will grant--some of them,
particularly users of the Unix system call interface--often don't.
But, if you do, and you can justify it, then you can also justify
"false" and "nullptr". (I will leave my irritation with the scoping
of enumeration constants for another time. Why, why, why, was the
'.' operator not used as a value selector within an enum type or
object, and 'typedef' employed where it could have done some good
for once?)

I don't think it is an accident that there are no function pointer
literals in the language either (you can of course construct one
through casting, a delightfully evil bit of business). The lack
made it easier to paper over the deficiency. Remember that C came
out of the Labs without fully-fledged struct literals ("compound
literals"), either. If I'd been at the CSRC--who among us hasn't
had that fantasy?--I'd have climbed the walls over this lack of
orthogonality.

I will grant that the inertia against the above two points was, and
remains, strong. C++, never a language to reject a feature[6], resisted
the simple semantics and obvious virtues of a Boolean data type for
nearly the first 20 years of its existence, and only acquired `nullptr`
in C++11. Prior to that point, C++ was militant about `0` being the
null pointer constant, in line with Doug's preference--none of this
shouty "NULL" nonsense.

I cynically suspect the feature-resistance on these specific two points
as being a means of reinforcing people's desires, prejudices, or sales
pitches regarding C++ as a language that remained "close to the
machine", because they were easy to conceptualize and talk about. And
of course anything C++ didn't need, C didn't need either, because
leanness. Eventually, I guess, WG21 decided that the battle front over
that claim was better defended elsewhere. Good!

> Because I don't know what foo(a, b, 0, 0) is, and I don't know from
> memory the position of all parameters to the functions I use (and
> checking them every time would be cumbersome, although I normally do,
> just because it's easier to just check, but I don't feel forced to do
> it so it's nicer).

Kids these days will tell you to use an IDE that pops up a tool tip with
the declaration of 'foo'. That this is necessary or even as useful as
it is discloses problems with API design, in my opinion.

> Was the third parameter to foo() the pointer and the fourth a length,
> or was it the other way around? bcopy(3) vs memcpy(3) come to mind,
> although of course no-one would memcpy(dest, NULL, 0) with hardcoded
> 0s in their Right Mind (tm) (although another story is to call
> memcpy(dest, src, len) with src and len both 0).

The casual inconsistency of the standard C library has many more
manifestations than that.

> Knowing with 100% certainty that something is a pointer or an integer
> just by reading the code and not having to go to the definition
> somewhere far from it, improves readability.

I entirely agree.

> Even with a program[...] that finds the definitions in a big tree of
> code, it would be nice if one wouldn't need to do it so often.

Or like Atom[7], which has been a big hit among Millennial programmers.
Too bad, thanks to who acquired GitHub, it's being killed off and
replaced by Visual Studio now. We call this market efficiency.

> Kind of the argument that some use to object to my separation of
> man3type pages, but for C code.

Here I disagree, because while we are clearly aligned on the null
pointer literal...er, point, I will continue to defend the position that
a man page for a header file--where an API has been designed rather than
accreted--is a preferable place to document the data types and constants
it defines, as well as either an overview or a detailed presentation of
its functions, depending on its size and/or complexity.

I think the overwhelming importance of the C library to your documentary
mission in the Linux man-pages project is obscuring the potential of
elegant man pages for _other_ libraries. What passes for the C standard
library on _any_ system is quite a crumbcake.

The fact that you can't cram the C library into the shape I espouse
without making the pages huge and unwieldy is the fault of the library,
I humbly(!) suggest, not the fault of the principle. But I'd be a long
way from the first person to note that the C library is schizophrenic
upon even cursory inspection.

Is it true that Plauger never wrote another book after his (dense but
incredibly illuminating, A+++, would read again) _The Standard C
Library_? Did writing his own libc and an exposition of it "break" him
in some way? I imagine that he got perhaps as good a look as anyone's
ever had at how much better a standard library C could have had, if only
attention had been paid at the right historical moments. He saw how
entrenched the status quo was, and decided to go sailing...forever.
Just an irresponsible guess.

I would like to find a good place to state the recommendation that
documentors of _other_ libraries should not copy this approach of
trifurcating presentations of constants, data types, and functions.
In a well-designed API, these things will be clearly related,
mechanically coupled, and should be considered together.

What do you think?

As always, spirited contention of any of my claims is welcome.

Regards,
Branden

[1] I say that, but I don't see it in the 1973 "Revised Report" on the
language. I guess "nil" came in later. Another Algol descendant,
Ada 83, had "null". And boy am I spoiling to fight with someone
over the awesomeness of Ada, the most unfairly maligned of any
language known to me.
[2] Like some CDC Cyber machines, I gather. Before my time.
[3] Pop quiz: what assembly language did Branden grow up on? It's hard
to escape our roots in some ways.
[4] I won't go so far as to say speculative execution is _stupid_,
though I wonder if it can ever actually be implemented without
introducing side channels for attack. Spec ex is performed, as I
understand it, because Moore's Law is dead, and CPU manufacturers
are desperately trying to satisfy sales engineers and certain
high-volume customers who are only willing to pay for cores that the
premium prices (profit margins) that the industry has been
accustomed to for 50 years. Purchasing decisions are made by suits,
and suits like to get to the bottom line, like "number go up".

https://davidgerard.co.uk/blockchain/2019/05/27/the-origin-of-number-go-up-in-bitcoin-culture/
[5] https://ubiquity.acm.org/article.cfm?id=1513451
[6] Stroustrup would disagree and can support his argument; see his
writings on the history of the language.
[7] https://en.wikipedia.org/wiki/Atom_(text_editor)

signature.asc
Description: PGP signature

Counterexamples in C programming and library documentation (was: [PATCH v3] NULL.3const: Add documentation for NULL)

Reply via email to