Re: uppercase german umlaut

2024-02-05 Thread Dave Kemper
On 2/5/24, hoh...@posteo.de  wrote:
> On Tue, 9 Jan 2024 01:13:45 -0600
> Dave Kemper  wrote:
>
>> In the message to which I was replying, you were speaking of the
>> sequence of bytes that were part of the input to gpic; in this realm,
>> ECMA-48 is irrelevant.  And in any case, the 0x84 byte in question is
>> part of the UTF-8 encoding of Unicode character U+00C4 LATIN CAPITAL
>> LETTER A WITH DIAERESIS; if it's being interpreted by a terminal
>> somewhere as ECMA-48, something is going wrong.
>>
>> What seems to be going wrong in this instance is that you're passing
>> UTF-8 directly to gpic without first running it through preconv or
>> iconv, resulting in a byte sequence gpic doesn't recognize.  You
>> haven't said whether you've tried converting the input before sending
>> it to gpic, or why you're avoiding preconv.
>
> I quote myself:
> "The character emerges from a input file name. So it is missed by
> preconv somewhere, ..."

Since you haven't said what your pipeline is, I can't debug what
preconv is missing or why.  But in general if you're doing something
like:

someprog | gpic

where "someprog" is outputting UTF-8, then you should change the pipeline to:

someprog | preconv -eutf8 | gpic

Like all groff tools, gpic will not recognize UTF-8 input.  The
encoding has to be converted before gpic sees it.

> You completely miss the point of the utf8 sequence "ä" passes while
> "Ä" issues.

I didn't miss this.  Lennart explained this in his December 28 reply
in this thread, and I reiterated it in my December 29 reply, and again
in my January 2 reply.  In short: UTF-8 "ä" in a Latin-1 context is
interpreted as two Latin-1 characters whereas UTF-8 "Ä" in a Latin-1
context is one Latin-1 character and one invalid (to groff tools)
control character.



Re: uppercase german umlaut

2024-01-08 Thread Dave Kemper
On 1/8/24, hoh...@posteo.de  wrote:
> On Tue, 2 Jan 2024 11:04:25 -0600
> Dave Kemper  wrote:
>
>> > ECMA-48 says for 0x84:
>>
>> Also irrelevant to groff, as it doesn't use ECMA-48.  Groff tools
>> (including gpic) take input in Latin-1, period.
>
> I don't think so. ECMA-48 may be interpreted by terminals.

In the message to which I was replying, you were speaking of the
sequence of bytes that were part of the input to gpic; in this realm,
ECMA-48 is irrelevant.  And in any case, the 0x84 byte in question is
part of the UTF-8 encoding of Unicode character U+00C4 LATIN CAPITAL
LETTER A WITH DIAERESIS; if it's being interpreted by a terminal
somewhere as ECMA-48, something is going wrong.

What seems to be going wrong in this instance is that you're passing
UTF-8 directly to gpic without first running it through preconv or
iconv, resulting in a byte sequence gpic doesn't recognize.  You
haven't said whether you've tried converting the input before sending
it to gpic, or why you're avoiding preconv.

> In the case of terminal output, those characters if interpreted as
> control sequences would thrown the output into disarray. Therefore,
> if I'm right, it's rejected as invalid but not passed through.

Correct, gpic won't pass through bytes it considers invalid.

$ echo Ä | od -t x1
000 c3 84 0a
003
$ echo Ä | pic | grep -av '^\.' | od -t x1
pic::1: invalid input character code 132
000 c3 0a
002

gpic strips the 0x84 (decimal 132) byte, leaving you with invalid
UTF-8, or valid but erroneous Latin-1.



Re: uppercase german umlaut

2024-01-08 Thread hohe72

On Tue, 2 Jan 2024 11:04:25 -0600
Dave Kemper  wrote:

> > ECMA-48 says for 0x84:  
> 
> Also irrelevant to groff, as it doesn't use ECMA-48.  Groff tools
> (including gpic) take input in Latin-1, period.

I don't think so. ECMA-48 may be interpreted by terminals.

In the case of terminal output, those characters if interpreted as
control sequences would thrown the output into disarray. Therefore,
if I'm right, it's rejected as invalid but not passed through.


pgpUofz4DptYg.pgp
Description: OpenPGP digital signature


Re: uppercase german umlaut

2024-01-02 Thread Dave Kemper
[moving this back to the thread where it belongs]

On 1/2/24, hoh...@posteo.de  wrote:
> If gpic gets Ä (0xc3 0x84) it complains about 0x84.
> If gpic gets ä (0xc3 0xa4) it does not complain about 0xa4.

True, but irrelevant, because *in neither case will the character be
interpreted the way you intend*.

gpic will consider 0xc3 0x84 a valid Latin-1 character (LATIN CAPITAL
LETTER A WITH TILDE) and an invalid character.

gpic will consider 0xc3 0xa4 two valid Latin-1 characters (LATIN
CAPITAL LETTER A WITH TILDE and CURRENCY SIGN).

What you're trying to send to gpic in your two examples is LATIN
CAPITAL LETTER A WITH DIAERESIS and LATIN SMALL LETTER A WITH
DIAERESIS.  But if those are sent as UTF-8 to gpic, it will not
interpret them as you want.  To get what you want, you need to convert
your input to Latin-1, or run it through preconv before gpic.

> ECMA-48 says for 0x84:

Also irrelevant to groff, as it doesn't use ECMA-48.  Groff tools
(including gpic) take input in Latin-1, period.  (Pure ASCII, being a
subset of Latin-1, is also valid.)  Any bytes that aren't Latin-1
characters are illegal input to all groff tools.  The only exception
is preconv, which recognizes various encodings and converts them to
pure ASCII, with all non-ASCII characters being converted to groff
escape sequences.

> If you want to know why I ignore preconv, read the last mail.)

I don't recall a previous message giving a reason for this, but if you
don't use preconv (or convert input to Latin-1 by some means), you're
not going to get what you want.



Re: uppercase german umlaut

2023-12-29 Thread Dave Kemper
On 12/28/23, holger.herrl...@posteo.de  wrote:
> echo ä | gpic | hexStream
> 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
> 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
> 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
> 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
> 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
> 0xc3 0xa4 0x0a   | ...
>
> echo Ä | gpic | hexStream
> gpic::1: invalid input character code 132
> 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
> 0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
> 0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
> 0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
> 0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
> 0xc3 0x0a| ..
>
> The character emerges from a input file name. So it is missed by
> preconv somewhere,

As Lennart points out, the above pipelines don't invoke preconv at
all.  But also the above examples don't come from a filename, so I
suspect your example is too simplified from your actual use case to
illustrate the problem.  Do you have a command sequence that DOES
invoke preconv where UTF-8 characters are not being correctly handled?

> however why is 'ä' working properly/ just passed through?

It's not "working properly" in a sense that groff can handle.  The
input above shows the ä is coming out as 0xc3 0xa4, which is the UTF-8
encoding of the character.  But were this to go into a groff pipeline,
it would interpret those two bytes as two Latin-1 characters, neither
of which is ä.

(In the example you posted at the start of this thread, where the 0xc3
0xa4 went to the terminal, your terminal interpreted that sequence as
UTF-8 and displayed an ä.  So it only looked "right" because your
input and output encodings matched.)

Your second example shows that pic is discarding the byte of Ä's
encoding it doesn't recognize as valid Latin-1.  You can see this in
two ways: this byte is missing from your hexStream output, and pic
throws an error.  The only byte left, 0xc3, is a Latin-1 Ã, which how
groff would interpret it.  But your terminal, expecting UTF-8, would
be unable to output anything meaningful for this.



Re: uppercase german umlaut

2023-12-28 Thread Lennart Jablonka

Quoth holger.herrl...@posteo.de:

echo ä | gpic | hexStream
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
0xc3 0xa4 0x0a   | ...

echo Ä | gpic | hexStream
gpic::1: invalid input character code 132
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
0xc3 0x0a| ..

The character emerges from a input file name. So it is missed by
preconv somewhere, however why is 'ä' working properly/ just passed
through?


You don’t seem to be running preconv.  Are you?

gpic is reading from standard input the bytes a4 c3 (ä) or 
84 c3 (Ä).  It interprets those as Latin 1: a4 c3 is ¤ Ã.  
84 c3 is a control character followed by Ã.  The control 
characters 80–9f are invalid.




Re: uppercase german umlaut

2023-12-28 Thread holger.herrlich

echo ä | gpic | hexStream
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
0xc3 0xa4 0x0a   | ...

echo Ä | gpic | hexStream
gpic::1: invalid input character code 132
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x53  | .if !dPS
0x20 0x2e 0x64 0x73 0x20 0x50 0x53 0x0a  |  .ds PS.
0x2e 0x69 0x66 0x20 0x21 0x64 0x50 0x45  | .if !dPE
0x20 0x2e 0x64 0x73 0x20 0x50 0x45 0x0a  |  .ds PE.
0x2e 0x6c 0x66 0x20 0x31 0x20 0x2d 0x0a  | .lf 1 -.
0xc3 0x0a| ..

The character emerges from a input file name. So it is missed by
preconv somewhere, however why is 'ä' working properly/ just passed
through?



On Wed, 27 Dec 2023 01:29:53 -0600
Dave Kemper  wrote:

> On 12/26/23, holger.herrl...@posteo.de 
> wrote:
> > echo Ä | gpic
> > .if !dPS .ds PS
> > .if !dPE .ds PE
> > .lf 1 -
> > gpic::1: invalid input character code 132
> > �  
> 
> Hi Holger,
> 
> The paste above doesn't reveal what sequences of bytes your "echo" is
> outputting, but I deduce it's UTF-8, since "U+00C4 LATIN CAPITAL
> LETTER A WITH DIAERESIS" is encoded in UTF-8 as the two-byte hex
> sequence c3 84, the latter byte of which is 132 decimal, which is the
> number in your error message.  This is what I get in a UTF-8
> environment:
> 
> $ echo Ä | od -t u1
> 000 195 132  10
> 003
> 
> Unfortunately, the groff toolchain doesn't speak UTF-8, only Latin-1
> (and expanding this is a longstanding wish-list item:
> http://savannah.gnu.org/bugs/?40720).  So before pic sees the input,
> you'll have to convert it to a form pic understands.
> 
> The most flexible way to do this is with groff's preconv tool, because
> this will convert a wide range of Unicode input into escapes that the
> groff tools understand.
> 
> $ echo Ä | preconv -eutf-8
> .lf 1 -
> \[u00C4]
> 
> If all your input falls into the Latin-1 range, you can instead use
> the system iconv command to convert everything to Latin-1 (a.k.a. ISO
> 8859-1).
> 
> $ echo Ä | iconv -futf-8 -tiso-8859-1 | od -t u1
> 000 196  10
> 002
> 



pgpUJ0ELyPqPr.pgp
Description: OpenPGP digital signature


Re: uppercase german umlaut

2023-12-26 Thread Dave Kemper
On 12/26/23, holger.herrl...@posteo.de  wrote:
> echo Ä | gpic
> .if !dPS .ds PS
> .if !dPE .ds PE
> .lf 1 -
> gpic::1: invalid input character code 132
> �

Hi Holger,

The paste above doesn't reveal what sequences of bytes your "echo" is
outputting, but I deduce it's UTF-8, since "U+00C4 LATIN CAPITAL
LETTER A WITH DIAERESIS" is encoded in UTF-8 as the two-byte hex
sequence c3 84, the latter byte of which is 132 decimal, which is the
number in your error message.  This is what I get in a UTF-8
environment:

$ echo Ä | od -t u1
000 195 132  10
003

Unfortunately, the groff toolchain doesn't speak UTF-8, only Latin-1
(and expanding this is a longstanding wish-list item:
http://savannah.gnu.org/bugs/?40720).  So before pic sees the input,
you'll have to convert it to a form pic understands.

The most flexible way to do this is with groff's preconv tool, because
this will convert a wide range of Unicode input into escapes that the
groff tools understand.

$ echo Ä | preconv -eutf-8
.lf 1 -
\[u00C4]

If all your input falls into the Latin-1 range, you can instead use
the system iconv command to convert everything to Latin-1 (a.k.a. ISO
8859-1).

$ echo Ä | iconv -futf-8 -tiso-8859-1 | od -t u1
000 196  10
002



uppercase german umlaut

2023-12-26 Thread holger.herrlich

gpic --version
GNU pic (groff) version 1.22.4

echo ä | gpic
.if !dPS .ds PS
.if !dPE .ds PE
.lf 1 -
ä

echo Ä | gpic
.if !dPS .ds PS
.if !dPE .ds PE
.lf 1 -
gpic::1: invalid input character code 132
�


This gpic error message emerged new after updating to debian 12. The
system is no fresh install.


pgpg0VAimMdYP.pgp
Description: OpenPGP digital signature