Re: uppercase german umlaut
On 2/5/24, hoh...@posteo.de wrote: > On Tue, 9 Jan 2024 01:13:45 -0600 > Dave Kemper wrote: > >> In the message to which I was replying, you were speaking of the >> sequence of bytes that were part of the input to gpic; in this realm, >> ECMA-48 is irrelevant. And in any case, the 0x84 byte in question is >> part of the UTF-8 encoding of Unicode character U+00C4 LATIN CAPITAL >> LETTER A WITH DIAERESIS; if it's being interpreted by a terminal >> somewhere as ECMA-48, something is going wrong. >> >> What seems to be going wrong in this instance is that you're passing >> UTF-8 directly to gpic without first running it through preconv or >> iconv, resulting in a byte sequence gpic doesn't recognize. You >> haven't said whether you've tried converting the input before sending >> it to gpic, or why you're avoiding preconv. > > I quote myself: > "The character emerges from a input file name. So it is missed by > preconv somewhere, ..." Since you haven't said what your pipeline is, I can't debug what preconv is missing or why. But in general if you're doing something like: someprog | gpic where "someprog" is outputting UTF-8, then you should change the pipeline to: someprog | preconv -eutf8 | gpic Like all groff tools, gpic will not recognize UTF-8 input. The encoding has to be converted before gpic sees it. > You completely miss the point of the utf8 sequence "ä" passes while > "Ä" issues. I didn't miss this. Lennart explained this in his December 28 reply in this thread, and I reiterated it in my December 29 reply, and again in my January 2 reply. In short: UTF-8 "ä" in a Latin-1 context is interpreted as two Latin-1 characters whereas UTF-8 "Ä" in a Latin-1 context is one Latin-1 character and one invalid (to groff tools) control character.
Re: uppercase german umlaut
On 1/8/24, hoh...@posteo.de wrote: > On Tue, 2 Jan 2024 11:04:25 -0600 > Dave Kemper wrote: > >> > ECMA-48 says for 0x84: >> >> Also irrelevant to groff, as it doesn't use ECMA-48. Groff tools >> (including gpic) take input in Latin-1, period. > > I don't think so. ECMA-48 may be interpreted by terminals. In the message to which I was replying, you were speaking of the sequence of bytes that were part of the input to gpic; in this realm, ECMA-48 is irrelevant. And in any case, the 0x84 byte in question is part of the UTF-8 encoding of Unicode character U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS; if it's being interpreted by a terminal somewhere as ECMA-48, something is going wrong. What seems to be going wrong in this instance is that you're passing UTF-8 directly to gpic without first running it through preconv or iconv, resulting in a byte sequence gpic doesn't recognize. You haven't said whether you've tried converting the input before sending it to gpic, or why you're avoiding preconv. > In the case of terminal output, those characters if interpreted as > control sequences would thrown the output into disarray. Therefore, > if I'm right, it's rejected as invalid but not passed through. Correct, gpic won't pass through bytes it considers invalid. $ echo Ä | od -t x1 000 c3 84 0a 003 $ echo Ä | pic | grep -av '^\.' | od -t x1 pic::1: invalid input character code 132 000 c3 0a 002 gpic strips the 0x84 (decimal 132) byte, leaving you with invalid UTF-8, or valid but erroneous Latin-1.
Re: uppercase german umlaut
On Tue, 2 Jan 2024 11:04:25 -0600 Dave Kemper wrote: > > ECMA-48 says for 0x84: > > Also irrelevant to groff, as it doesn't use ECMA-48. Groff tools > (including gpic) take input in Latin-1, period. I don't think so. ECMA-48 may be interpreted by terminals. In the case of terminal output, those characters if interpreted as control sequences would thrown the output into disarray. Therefore, if I'm right, it's rejected as invalid but not passed through. pgpUofz4DptYg.pgp Description: OpenPGP digital signature
Re: uppercase german umlaut
[moving this back to the thread where it belongs] On 1/2/24, hoh...@posteo.de wrote: > If gpic gets Ä (0xc3 0x84) it complains about 0x84. > If gpic gets ä (0xc3 0xa4) it does not complain about 0xa4. True, but irrelevant, because *in neither case will the character be interpreted the way you intend*. gpic will consider 0xc3 0x84 a valid Latin-1 character (LATIN CAPITAL LETTER A WITH TILDE) and an invalid character. gpic will consider 0xc3 0xa4 two valid Latin-1 characters (LATIN CAPITAL LETTER A WITH TILDE and CURRENCY SIGN). What you're trying to send to gpic in your two examples is LATIN CAPITAL LETTER A WITH DIAERESIS and LATIN SMALL LETTER A WITH DIAERESIS. But if those are sent as UTF-8 to gpic, it will not interpret them as you want. To get what you want, you need to convert your input to Latin-1, or run it through preconv before gpic. > ECMA-48 says for 0x84: Also irrelevant to groff, as it doesn't use ECMA-48. Groff tools (including gpic) take input in Latin-1, period. (Pure ASCII, being a subset of Latin-1, is also valid.) Any bytes that aren't Latin-1 characters are illegal input to all groff tools. The only exception is preconv, which recognizes various encodings and converts them to pure ASCII, with all non-ASCII characters being converted to groff escape sequences. > If you want to know why I ignore preconv, read the last mail.) I don't recall a previous message giving a reason for this, but if you don't use preconv (or convert input to Latin-1 by some means), you're not going to get what you want.
Re: uppercase german umlaut
On 12/28/23, holger.herrl...@posteo.de wrote: > echo ä | gpic | hexStream > 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS > 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. > 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE > 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. > 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. > 0xc3 0xa4 0x0a | ... > > echo Ä | gpic | hexStream > gpic::1: invalid input character code 132 > 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS > 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. > 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE > 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. > 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. > 0xc3 0x0a| .. > > The character emerges from a input file name. So it is missed by > preconv somewhere, As Lennart points out, the above pipelines don't invoke preconv at all. But also the above examples don't come from a filename, so I suspect your example is too simplified from your actual use case to illustrate the problem. Do you have a command sequence that DOES invoke preconv where UTF-8 characters are not being correctly handled? > however why is 'ä' working properly/ just passed through? It's not "working properly" in a sense that groff can handle. The input above shows the ä is coming out as 0xc3 0xa4, which is the UTF-8 encoding of the character. But were this to go into a groff pipeline, it would interpret those two bytes as two Latin-1 characters, neither of which is ä. (In the example you posted at the start of this thread, where the 0xc3 0xa4 went to the terminal, your terminal interpreted that sequence as UTF-8 and displayed an ä. So it only looked "right" because your input and output encodings matched.) Your second example shows that pic is discarding the byte of Ä's encoding it doesn't recognize as valid Latin-1. You can see this in two ways: this byte is missing from your hexStream output, and pic throws an error. The only byte left, 0xc3, is a Latin-1 Ã, which how groff would interpret it. But your terminal, expecting UTF-8, would be unable to output anything meaningful for this.
Re: uppercase german umlaut
Quoth holger.herrl...@posteo.de: echo ä | gpic | hexStream 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. 0xc3 0xa4 0x0a | ... echo Ä | gpic | hexStream gpic::1: invalid input character code 132 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. 0xc3 0x0a| .. The character emerges from a input file name. So it is missed by preconv somewhere, however why is 'ä' working properly/ just passed through? You don’t seem to be running preconv. Are you? gpic is reading from standard input the bytes a4 c3 (ä) or 84 c3 (Ä). It interprets those as Latin 1: a4 c3 is ¤ Ã. 84 c3 is a control character followed by Ã. The control characters 80–9f are invalid.
Re: uppercase german umlaut
echo ä | gpic | hexStream 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. 0xc3 0xa4 0x0a | ... echo Ä | gpic | hexStream gpic::1: invalid input character code 132 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53 | .if !dPS 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a | .ds PS. 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45 | .if !dPE 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a | .ds PE. 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a | .lf 1 -. 0xc3 0x0a| .. The character emerges from a input file name. So it is missed by preconv somewhere, however why is 'ä' working properly/ just passed through? On Wed, 27 Dec 2023 01:29:53 -0600 Dave Kemper wrote: > On 12/26/23, holger.herrl...@posteo.de > wrote: > > echo Ä | gpic > > .if !dPS .ds PS > > .if !dPE .ds PE > > .lf 1 - > > gpic::1: invalid input character code 132 > > � > > Hi Holger, > > The paste above doesn't reveal what sequences of bytes your "echo" is > outputting, but I deduce it's UTF-8, since "U+00C4 LATIN CAPITAL > LETTER A WITH DIAERESIS" is encoded in UTF-8 as the two-byte hex > sequence c3 84, the latter byte of which is 132 decimal, which is the > number in your error message. This is what I get in a UTF-8 > environment: > > $ echo Ä | od -t u1 > 000 195 132 10 > 003 > > Unfortunately, the groff toolchain doesn't speak UTF-8, only Latin-1 > (and expanding this is a longstanding wish-list item: > http://savannah.gnu.org/bugs/?40720). So before pic sees the input, > you'll have to convert it to a form pic understands. > > The most flexible way to do this is with groff's preconv tool, because > this will convert a wide range of Unicode input into escapes that the > groff tools understand. > > $ echo Ä | preconv -eutf-8 > .lf 1 - > \[u00C4] > > If all your input falls into the Latin-1 range, you can instead use > the system iconv command to convert everything to Latin-1 (a.k.a. ISO > 8859-1). > > $ echo Ä | iconv -futf-8 -tiso-8859-1 | od -t u1 > 000 196 10 > 002 > pgpUJ0ELyPqPr.pgp Description: OpenPGP digital signature
Re: uppercase german umlaut
On 12/26/23, holger.herrl...@posteo.de wrote: > echo Ä | gpic > .if !dPS .ds PS > .if !dPE .ds PE > .lf 1 - > gpic::1: invalid input character code 132 > � Hi Holger, The paste above doesn't reveal what sequences of bytes your "echo" is outputting, but I deduce it's UTF-8, since "U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS" is encoded in UTF-8 as the two-byte hex sequence c3 84, the latter byte of which is 132 decimal, which is the number in your error message. This is what I get in a UTF-8 environment: $ echo Ä | od -t u1 000 195 132 10 003 Unfortunately, the groff toolchain doesn't speak UTF-8, only Latin-1 (and expanding this is a longstanding wish-list item: http://savannah.gnu.org/bugs/?40720). So before pic sees the input, you'll have to convert it to a form pic understands. The most flexible way to do this is with groff's preconv tool, because this will convert a wide range of Unicode input into escapes that the groff tools understand. $ echo Ä | preconv -eutf-8 .lf 1 - \[u00C4] If all your input falls into the Latin-1 range, you can instead use the system iconv command to convert everything to Latin-1 (a.k.a. ISO 8859-1). $ echo Ä | iconv -futf-8 -tiso-8859-1 | od -t u1 000 196 10 002
uppercase german umlaut
gpic --version GNU pic (groff) version 1.22.4 echo ä | gpic .if !dPS .ds PS .if !dPE .ds PE .lf 1 - ä echo Ä | gpic .if !dPS .ds PS .if !dPE .ds PE .lf 1 - gpic::1: invalid input character code 132 � This gpic error message emerged new after updating to debian 12. The system is no fresh install. pgpg0VAimMdYP.pgp Description: OpenPGP digital signature