As some readers may know, texindex is a program distributed with Texinfo
that sorts the index files produced by texinfo.tex, the format file for
processing Texinfo input files with TeX.  texindex is run by the texi2pdf
program that also runs TeX.

texindex was rewritten several years ago (in 2015, in fact), from its original
C implementation (the copyright notice on that implementation dated all the
way to 1987).  The major reason at the time for this rewrite was a difficulty 
with
the previous implementation in the indexing of entries beginning with brace
characters.  There may also have been some frustration with texindex as a C
program, requiring manual memory management as C programs do and overall being
somewhat long and complicated for the task that it did.

This new implementation was carried out by long-time GNU Awk maintainer Arnold
Robbins, and implemented in a "literate programming" system of his own design,
built on top of Awk.  "Literate programming" is an approach to programming
utilised by TeX creator Donald Knuth, which is probably the reason that most
people have heard of it.  However, it is not an approach that has caught on,
generally speaking.

The new Awk implementation weathered changes, accommodating new possibilities
in the Texinfo language for index subentries and explicitly provided sort keys.
However, a couple of weeks ago, I converted the literate programming
implementation of texindex to plain Awk.  The literate programming
implementation is no more.

The reasons that I had for doing this were manifold.  One reason was that it
is possible that at some point in the future we will want to rewrite texindex
yet again in another language, Awk being somewhat underpowered for the task
and being limited in its handling of non-ASCII character encodings.

However, I also wanted to make a non-trivial change to the program
(change the rules for when separate lines in the input index file should be
considered as indexing the "same" term, thus producing a single index entry
in the result with a list of multiple page numbers).  In my experience, I
struggled to make headway with this change with the program in its "literate
programming" state.

I thought it might be of interest to myself or others to make a few notes
about the difficulties I had with literate programming, before the details
of this slip from my memory.

One difficulty I had was that I didn't know which format to be studying the
source code in.  This literate programming system ("Texiweb Jr.") had a source
file (with a .twjr extension), from which the program file (with a .awk
extension) and document file (with a .texi extension, which could further be
processed into an Info or HTML hyperlinked document).

I supposed that the most appropriate format to be studying the source code
in was the hyperlinked document.  However, this just did not match how I am
used to program.  I'm used to scrolling up and down in my text editor of
choice, not reading source code in an Info browser.

I often wanted to look at the produced .awk file, but this was lacking
in the commentary that was in the literate source, some of which was relevant
to understanding the source code.  So I started moving text into comments,
and some point just kept on going and moved everything into comments, so to
jettison the literate source altogether.

I feel that the difficulty I had was often similar to the difficulty
of understanding a computer program that uses many global variables.
It is hard to understand code that uses global variables, as it is
hard to reason about what these variables mean.  They could have acquired
their values in many different parts of the code from the part that you
are currently looking at.  You can't just look at one part of the code
and understand how it works, you have to look all over the program wherever
that global variable is set to get an idea of what it means.

In literate programming, code is defined in "chunks" that likewise
reference other chunks.  Each chunk can set or read variables that
are set or read in other chunks.

I will try to illustrate this with an example.  Here is the chunk
definition in literate texindex for reading a line from an input
file:

    <Process a record> ≡
         {
             <Remove duplicates>
             <Remove leading ‘\entry’ or ‘@entry’>
             <Get the initial>
             <Set up ‘fields’ array with the data>
             <Name the fields>
             <Store the data for this line in the various arrays>
             <Check for more than one initial>
         }

This looks alright as an overview of the process, but computer code
is a functional object that you have to understand the workings of
in detail.  Each one of these lines can set variables that are accessed
inside other lines.  The top-level overview doesn't show what these
variables are or what the data flow is between the chunks.

Here is part of the literate document containing the definition for one
of these chunks, <Store the data for this line in the various arrays>:

    ...

       In addition to all the previously described arrays, the key is stored
    in the ‘Keys’ array the first time it is seen; this array is sorted
    later on.  Its indices are just incremented integers, stored in the
    global ‘Entries’ variable.  The ‘Allkeys’ associative array lets us
    easily track if we have seen a key before.
    
    <Store the data for this line in the various arrays> ≡
         if (! (key in Allkeys)) {
             # first time we've seen this full line
    ...

To understand this, you have to understand what the "key" variable is.
This variable is defined in the previous chunk, which appears in a
different node in the Info file.

    <Name the fields> ≡
         key = fields[1]
         pagenum = fields[2]
         primary_text = fields[3]
         secondary_text = (numfields > 3 ? fields[4] : "")
         tertiary_text = (numfields > 4 ? fields[5] : "")

As the assignation and the use of the variable are split up across
nodes, it is harder to follow.

I feel like this could cause a problem when trying to track down bugs in
a program.  Often you are trying to narrow down the problem and you would
do this by reasoning about where variables may have got their values and
where the problematic values were introduced - this is not so easy when
it's hard to tell where variables could be assigned to as the code is
put together via textual inclusion.

Crucial to the change I wanted to make was the "Allkeys" variable.  This
was used for detecting duplicate index entries.  So it was important for me
to understand how this variable was set and how it was used in the program.

In my favourite text editor, I would search for "Allkeys" in the file and
scroll between all the matches.  This does not work as well in the Info
browser where the functional parts of the program are split across different
nodes.

I was able to make the change I wanted to once I could get all the code
in a single file with relevant comments, so I could work on the program as
I am used to.  As the "Allkeys" variable (and others) are global variables
(due to limitations of the Awk language, I believe, as this language is
lacking in data structures), I had to look at the whole program in its
totality, which would have been hard for me to do with the "peephole"
approach of navigating the code in an Info browser.

I also found that the literate source, would add an extra layer of
description that would not be necessary in a conventionally written
computer program with comments.  A few excerpts from the literate source
may illustrate this.

    "The initial setup sets up some constants, including the version of the
    program."
    
    "Argument processing is straightforward, though manual."
    
    "Upon end of input, the processing is straightforward: sort the entries
    and write them out.  Additionally, if we are printing the initial,
    handle that.  (That printing task is delegated to a small function.)"

Such information often does not need to be explicitly stated: when a
programmer is studying source code to understand how a program works or to
make modifications, they can often see for themselves how straightforward
the code is, or what the structure of the code is.

This is not intended as a criticism of the program itself (or the
programmer) as I believe this verbosity is encouraged by the format.  The
writer feels obliged to add some commentary around blocks of code in
the literate source.


Reply via email to