bug#79824: fmt not correctly process text with UTF-8 characters encoding

Martin D Kealey Sun, 16 Nov 2025 18:11:34 -0800

On Thu, 13 Nov 2025 at 04:47, Pádraig Brady <[email protected]> wrote:


> Yes this is a known issue which we're gradually getting to.
>

Dealing with *just* alphabetic scripts is relatively easy, but in general
the rules for flowing unicode text into paragraphs are considerably more
complicated than for plain ASCII.

What's the plan for handling double-width, zero-width, and combining
characters?
"Shy" hyphens?
Scripts that don't put spaces between words?
Non-breaking and non-joiner codepoints, additional line & paragraph
terminators, etc.

Combining characters follow rather than precede the principal character in
the data stream, so scanning would need to continue even after the line is
apparently "full" to ensure that they're included.

I guess this should be coordinated with bug#79631 (UTF-8 support in the
"cut" utility), at least in terms of documenting whether the count applies
to code-points, to composed characters, or to cells (0, 1 or 2 per composed
character); if counting cells, it should document that the number of cells
would be rounded down because double-width characters can't be split.

-Martin

bug#79824: fmt not correctly process text with UTF-8 characters encoding

Reply via email to