Project bloat report

Gavin Smith Tue, 03 Mar 2026 13:17:16 -0800

One of the threats to a project's health is bloat, or scope creep.  I
feel like it is an appropriate moment, following the Texinfo release,
to report on some ways that the Texinfo project could be seen as bloated.
(You could also say that considering big picture issues is a pleasure
we can allow ourselves after the time considering small picture issues.)


This could be of use to current or future developers of the project.

I've come to think of the Unix philosophy of "do one thing, and do it
well" as a maxim of project management, as much as it is a criterion
for the beauty or utility of a technical product.  If too much work
is funnelled through a single person, they will be spread too thin,
and drop things on the floor (to mix three metaphors).

A bloated project can die because the current developers may feel daunted
by the prospect of making new changes as they are increasingly difficult
and time-consuming to make, and potential new developers fail to get
to grips with the code in order to make contributions with the time or
effort they are willing to put into it.

Responsibilities and tasks grow while the enthusiasm, time and ability of
workers to fulfil these diminishes.

A few years ago, Richard Stallman wrote the following on a private
mailing list, in response to another email I wrote about the problems
of project maintenance:

> An idea that occurs to me is to make it easy to carve out a niche
> within the project which is modular enough that a few contributors can
> do a lot of work inside it without having to coordinate very often
> about things outside that niche.
(2021-01-02)

I think that's worth thinking about.

Readers may have had the experience themselves of attempting to work on
a code base that appears to have been abandoned by its original developers,
and failing to make headway on it in any reasonable time.

The Texinfo project is not dead at the moment but it may still get more
and more bloated over time.

Hence, I have been keeping a list of aspects of the project that could be
considered bloat.

* The infog/ document browser embedding a WebKitGTK web browser, for viewing
  local HTML documentation.  The existing developers clearly don't have time to
  work on this, and it seems unlikely that it would ever become used by many
  people.  We should work on defining standards for Texinfo HTML documentation
  instead and promote the installation of local HTML manuals and reading them
  by existing help programs instead (see TODO.HTML in the Texinfo repository
  for some notes).

* The browsing interface under js/, including the info.js file (enabled
  with the INFO_JS_DIR option to Texinfo).  The existing Texinfo developers
  very likely don't have the time or interest to become expert JavaScript
  programmers or web developers, and again, this part of the code base sees
  minimal development.  Although this code does have a small number of users,
  it would be better developed outside of the Texinfo project by JavaScript
  enthusiasts.  Again, defining the format of HTML output by texi2any would
  be a help in making a separation here.

* The Texinfo language reference card.  It is one more thing to keep
  updated, although there's no indication that many people look at it.
  The reference card is actually 5 cards as the document goes across 5
  pages.

* The pod2texi program.  I suppose the purpose of this program is clear:
  it converts POD to Texinfo.  I've never used it and never touched the
  sources.  It does not seem to be getting in the way of other development,
  so can probably be left alone.

* In the Texinfo language, there are several commands that are unnecessary.
  @definfoenclose is the worst one as this allows defining new Texinfo
  commands, which adds significant complexity.  (When writing the
  Parsetexi C code and I was reading Texinfo/Parser.pm for the first
  time, I kept on asking myself, what is this definfoenclose feature I
  keep seeing referred to all over the code?)  Unfortunately we don't
  want to break building old versions of documentation unnecessarily,
  so should only remove commands if they were never used by anybody in
  the first place (I believe @clickstyle may have been in this category).

Many of my comments relate to conversion of Texinfo:

* First, there are two main converters: texinfo.tex with TeX, and texi2any.
  Implementing significant new changes in syntax using TeX is often difficult.
  TeX is simply an inappropriate system for many aspects of text processing.
  On the plus side, conversion of Texinfo manuals with TeX is very well
  understood, and I usually find it quick to fix problems with texinfo.tex.
  It helps that TeX is a stable, limited and self-contained system, and
  eventually somebody working with a system built on top of TeX will get
  to a sufficient level of understanding (which probably can't be said of
  the LaTeX ecosystem or JavaScript frameworks).

  It's not clear what would replace texinfo.tex (as, indeed, it is not
  clear what could replace TeX).  The LaTeX output from texi2any is not
  a sufficient replacement in my opinion: for one thing, it has more
  dependencies.  Moreover, it adds an extra layer of indirection to the
  output, making it harder to fine-tune the final result.

  One idea is to make texi2any output plain TeX code that doesn't use
  TeX macros, which would eliminate the macro programming side of TeX
  (no more \expandafter).  This means we would avoid having to output
  to PDF directly in texi2any, implementing our own line-breaking algorithm
  and so on.

  I am not advocating for this change and there isn't any imminent need
  for it but am mentioning it here as a possibility for the future.

* texi2any.  This is the component of the code that sees the most
  development, and which has the most complexity.

  - First, it could be its own project, separate from the other programs
    in Texinfo like info and install-info.  It already has its own
    configure.ac and gnulib checkout so would cleanly separate from
    the rest of the code.

    (I don't see any need for such a separation, but am mentioning it
    as a possibility as that is something that could potentially lead to
    more focused development.)

    However, it's coupled to texinfo.tex in terms of the Texinfo language.
    texinfo.tex is then coupled to the texindex and texi2dvi programs.
    The info program is coupled to the output of texi2any (e.g. supporting
    changes to the Info output like INFO_MATH_IMAGES that we recently
    implemented).  So it's probably not sensible to split things up.

  - It has two versions for much of its code - C and Perl.  This has
    been discussed previously.  Needless to say, this more than doubles
    the maintenance burden.

  - Features of texi2any that could be considered bloat:

     * The XML format output with --xml.  This is not actually useful
       for anything.
       The Texinfo manual suggests that users might want to use this as a
       starting point for conversion to other formats, but I'd actually
       rather they didn't, as it means that we have to maintain the XML
       converter, and would be better to have the converters built into
       texi2any like the other output formats.
       It's not a huge maintenance burden, except remembering to update
       the DTD when the language changes, and the time spent running
       the tests for XML output, of course.

     * The IXIN output format.  This format is not likely to see any
       further development, especially following the sad death of
       its creator Thien-Thi Nguyen.  This may not be an issue as the
       IXIN conversion code is now specificially in an "Example"
       subdirectory.

     * In HTML output, the option to create a special "About" page
       with the DO_ABOUT variable does not seem useful.

     * The SORT_ELEMENT_COUNT variable does not seem useful:

         If set, the name of a file to which a list of elements (nodes or
         sections, depending on the output format) is dumped, sorted by the
         number of lines they contain after removal of @-commands; default
         unset.  This is used by the program ‘texi-elements-by-size’ in the
         ‘util/’ directory of the Texinfo source distribution (*note
         texi-elements-by-size::).

     * The converter spends time and uses memory building "source marks"
       with details of expanded macros and included source files.  This
       information is not used in the output conversion, unless converting
       back to the Texinfo source it started with.  It's possible that
       some of this processing could be made optional for efficiency
       (I haven't investigated in detail how this could be accomplished.)

     * The Texinfo::Reader interface (new in 7.3).  tta/README in the
       Texinfo sources explains:

> The modules in perl/Texinfo/Example are not developped anymore.  Docbook
> conversion modules in this directory were developped using an interface
> consisting of Texinfo::Reader, Texinfo::TreeElement and
> Texinfo::Example::TreeElementConverter as a proof of concept.  However, this
> interface proved to be too slow in Perl and difficult to implement with XS
> code.  The Reader and TreeElement interface (except for one function) are not
> used from Perl anymore.  Going forward, the SWIG interface based on the
> Reader, Parser, Structuring and Texinfo Document C codes should
> be used.  The SWIG interface is in the swig directory.  Texinfo::Reader and
> Texinfo::TreeElement (except for the 'new' function) should not be used
> anymore.


     * The --transliterate-file-names feature.  This feature was only just
       turned off by default in the recent release.  It entails bundling
       the Text::Unidecode Perl module with Texinfo, which although is only
       1.3MB when extracted, bloats the directory listing (e.g. the output
       of "tar tf") with 289 files - you may or may not agree that this
       is a major problem.

     * The Unicode::CollateStub replacement in perl/Texinfo/Indices.pm
       is only needed on Red Hat-like systems where Unicode::Collate is not
       installed.  Recently with pre-release testing, this code had a problem
       which we fixed.  It isn't tested regularly.

  - The texi2any API is major source of potential bloat.

    Here's what I wrote in a private mail to Patrice on 2025-06-29:

> As you know my concern about using different languages for texi2any within
> the Texinfo project is one of simplicity and long-term maintainability.
> 
> My concern about user code is one of maintainability.  How well can we change
> the internals of texi2any if there is a lot of user code relying on internal
> details?
> 
> For example, Linux famously does not have a stable ABI for device
> drivers:
> 
> > You think you want a stable kernel interface, but you really do not,
> > and you don’t even know it. What you want is a stable running driver,
> > and you get that only if your driver is in the main kernel tree.
> >
> > ...
> >
> > As such, the kernel developers find bugs in current interfaces, or
> > figure out a better way to do things. If they do that, they then fix the
> > current interfaces to work better. When they do so, function names may
> > change, structures may grow or shrink, and function parameters may be
> > reworked. If this happens, all of the instances of where this interface
> > is used within the kernel are fixed up at the same time, ensuring that
> > everything continues to work properly.
> 
> https://www.kernel.org/doc/html/latest/process/stable-api-nonsense.html
> 
> (Of course, texi2any could be different for some reason.)
> 
> Here's another argument against function call interfaces as a stable 
> interface:
> 
> > Remote Procedure Calls
> >
> > ...
> >
> > As a related issue, interfaces that have richer type signatures also
> > tend to be more complex, therefore more brittle. Over time, they tend
> > to succumb to ontology creep as the inventory of types that get passed
> > across interfaces grows steadily larger and the individual types more
> > elaborate. Ontology creep is a problem because structs are more likely
> > to mismatch than strings; if the ontologies of the programs on each side
> > don't exactly match, it can be very hard to teach them to communicate
> > at all, and fiendishly difficult to resolve bugs. The most successful
> > RPC applications, such as the Network File System, are those in which
> > the application domain naturally has only a few simple data types.
> >
> > The usual argument for RPC is that it permits “richer” interfaces
> > than methods like text streams — that is, interfaces with a more
> > elaborate and application-specific ontology of data types. But the Rule
> > of Simplicity applies! We observed in Chapter 4 that one of the functions
> > of interfaces is as choke points that prevent the implementation details
> > of modules from leaking into each other. Therefore, the main argument
> > in favor of RPC is also an argument that it increases global complexity
> > rather than minimizing it.
> >
> > With classical RPC, it's too easy to do things in a complicated and
> > obscure way instead of keeping them simple. RPC seems to encourage the
> > production of large, baroque, over-engineered systems with obfuscated
> > interfaces, high global complexity, and serious version-skew and
> > reliability problems — a perfect example of thick glue layers run amok.
> 
> http://www.catb.org/esr/writings/taoup/html/ch07s03.html


    The SWIG interface, in providing an API for more programming languages,
    seems to make the potential problems with API stability worse.

    As far as I know there are only two or three projects which use the
    texi2any API (Lilypond, ffmpeg).  It seems every release there are
    changes to the API which needs fixes in these other packages.

    For example, after the Texinfo 7.2 release, the ffmpeg build broke:

    
https://www.linuxquestions.org/questions/slackware-14/texinfo-7-2-looks-to-have-broken-texinfo-convert-html-4175745581/

    I wrote at the time:

    > Such breakages seem inevitable as extension code could rely on many
    > details of internal texi2any code.  The new version of Texinfo is
    > then flagged as responsible for breaking compatibility.
    > 
    > This only stays manageable as long as the number of packages relying
    > on the Perl customization API stays low.

    If more packages start using the texi2any API, it will be a further
    source of breakage and even more work to go and find these packages
    to fix their customization code when a new release is made.

    On a lesser scale of problem, the API documentation takes quite a long
    time to build and upload to the GNU website when doing a new release
    because there is so much of it.

    Instead of promoting and expanding the use of API programming facilities,
    I think it would be better to find out what users were using the API for
    and design built-in features for supporting what they want to do.

Project bloat report

Reply via email to