Follow-up Comment #4, bug #66675 (group groff): At 2025-01-16T21:05:03-0500, Dave wrote: > Follow-up Comment #2, bug #66675 (group groff): > > [comment #1 comment #1:] >> \[u...] means an unicode character. > > It was permitted to mean other things for the past at least 20 years > and I bet a lot longer.
I've always wondered why they weren't spelled exactly as Unicode does,
like `\[U+002D]`, but that's water under the bridge. groff's naming
scheme for Unicode code points indeed has a long history.
> $ nroff --version
> GNU nroff (groff) version 1.19.2
> $ printf '.char \[unhappy] :-(\nI feel \[unhappy] today.\n' | nroff | cat -s
> I feel :‐( today.
>
> If this is an intentionally breaking change (a la bug #66673), it
> needs to be documented as such.
Nope, not intentional. Getting back to the original report...
At 2025-01-16T16:54:20-0500, Dave wrote:
> Date: Thu 16 Jan 2025 03:54:17 PM CST By: Dave <barx>
> From at least groff 1.19.2 through 1.23, this command produced nothing
> on stdout or stderr:
>
> echo '.char \[unhappy] :-(' | groff
>
> This is as it should be: the code merely defines a perfectly
> legitimate character and does nothing with it.
Agreed.
> The latest groff build produces two diagnostics on stderr:
>
> troff:<standard input>:1: error: special character 'unhappy' is
> invalid: Unicode special character sequence has non-hexadecimal digit
> 'n'
>
> troff:<standard input>:1: error: bad character definition
Uh-oh. Mea culpa.
> This erroneous error was introduced sometime after August 11. I blame an
> overzealous [http://git.savannah.gnu.org/cgit/groff.git/commit/?id=d29abf70a
> commit d29abf70a].
That's near to but not exactly where I'd place the blame.
What you link to is some new/refactored code associated with my titanic
struggle to land Unicode-rich PDF bookmarks (and device extension
command arguments generally).
At the point 'valid_unicode_code_sequence` is called, the caller _knows_
-- or is supposed to -- that it is expected a Unicode code sequence.
Possibly even a composite one.
groff_char(7):
Unicode code points can be composed as well; when they are, GNU
troff requires NFD (Normalization Form D), where all Unicode
glyphs are maximally decomposed. (Exception: precomposed
characters in the Latin‐1 supplement described above are also
accepted. Do not count on this exception remaining in a future
GNU troff that accepts UTF‐8 input directly.) Thus, GNU troff
accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”,
as ways
to input “café”. (Due to its legacy 8‐bit encoding
compatibility, at present it also accepts “caf\[u00E9]” on ISO
Latin‐1 systems.)
\[ubase‐char[_combining‐component]...]
constructs a composite glyph from Unicode numeric special
character escape sequences. The code points of the base
glyph and the combining components are each expressed in
hexadecimal, with an underscore (_) separating each
component. Thus, \[u006E_0303] produces “ñ”.
The problem, I suspect, is that I neglected to add logic to the
character definition request handlers. They should call that function
only if they truly require a valid Unicode code point sequence, and, of
course, they don't. You can `.char \[unhappy] :-(` all day long as far
as they're concerned.
The action item for this ticket is to check all of
`valid_unicode_code_sequence()`'s call sites. Ensure that all are
necessary and/or guarded by appropriate conditionals.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?66675>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
signature.asc
Description: PGP signature
