On Thu, 13 Nov 2025 at 04:47, Pádraig Brady <[email protected]> wrote:
> Yes this is a known issue which we're gradually getting to. > Dealing with *just* alphabetic scripts is relatively easy, but in general the rules for flowing unicode text into paragraphs are considerably more complicated than for plain ASCII. What's the plan for handling double-width, zero-width, and combining characters? "Shy" hyphens? Scripts that don't put spaces between words? Non-breaking and non-joiner codepoints, additional line & paragraph terminators, etc. Combining characters follow rather than precede the principal character in the data stream, so scanning would need to continue even after the line is apparently "full" to ensure that they're included. I guess this should be coordinated with bug#79631 (UTF-8 support in the "cut" utility), at least in terms of documenting whether the count applies to code-points, to composed characters, or to cells (0, 1 or 2 per composed character); if counting cells, it should document that the number of cells would be rounded down because double-width characters can't be split. -Martin
