ksh [emacs.c] -- simplify isu8cont()

2020-07-25 Thread ropers
Index: emacs.c
===
RCS file: /cvs/src/bin/ksh/emacs.c,v
retrieving revision 1.87
diff -u -r1.87 emacs.c
--- emacs.c 8 May 2020 14:30:42 -   1.87
+++ emacs.c 25 Jul 2020 16:31:22 -
@@ -269,10 +269,11 @@
{ 0, 0, 0 },
 };

+/* is octet a UTF-8 continuation byte? */
 int
 isu8cont(unsigned char c)
 {
-   return (c & (0x80 | 0x40)) == 0x80;
+   return (c & 0xC0) == 0x80;
 }

 int



Re: ksh [emacs.c] -- simplify isu8cont()

2020-07-25 Thread Martijn van Duren
This function is used throughout the OpenBSD tree and I think it's
fine as it is. This way it's clearer to me that it's about byte
7 and 8 and not have to do the math in my head to check if we
might have botched it.

Also compilers should be smart enough to optimize this out at
compile-time anyway.

martijn@

On Sat, 2020-07-25 at 17:40 +0100, ropers wrote:
> Index: emacs.c
> ===
> RCS file: /cvs/src/bin/ksh/emacs.c,v
> retrieving revision 1.87
> diff -u -r1.87 emacs.c
> --- emacs.c   8 May 2020 14:30:42 -   1.87
> +++ emacs.c   25 Jul 2020 16:31:22 -
> @@ -269,10 +269,11 @@
>   { 0, 0, 0 },
>  };
> 
> +/* is octet a UTF-8 continuation byte? */
>  int
>  isu8cont(unsigned char c)
>  {
> - return (c & (0x80 | 0x40)) == 0x80;
> + return (c & 0xC0) == 0x80;
>  }
> 
>  int
> 



Re: ksh [emacs.c] -- simplify isu8cont()

2020-07-25 Thread Ingo Schwarze
Hi,

Martijn van Duren wrote on Sat, Jul 25, 2020 at 09:54:53PM +0200:

> This function is used throughout the OpenBSD tree and I think it's
> fine as it is. This way it's clearer to me that it's about byte
> 7 and 8 and not have to do the math in my head to check if we
> might have botched it.
> 
> Also compilers should be smart enough to optimize this out at
> compile-time anyway.

Indeed, that's exactly why tedu@ designed it as it is.
  Ingo


> On Sat, 2020-07-25 at 17:40 +0100, ropers wrote:

>> Index: emacs.c
>> ===
>> RCS file: /cvs/src/bin/ksh/emacs.c,v
>> retrieving revision 1.87
>> diff -u -r1.87 emacs.c
>> --- emacs.c  8 May 2020 14:30:42 -   1.87
>> +++ emacs.c  25 Jul 2020 16:31:22 -
>> @@ -269,10 +269,11 @@
>>  { 0, 0, 0 },
>>  };
>> 
>> +/* is octet a UTF-8 continuation byte? */
>>  int
>>  isu8cont(unsigned char c)
>>  {
>> -return (c & (0x80 | 0x40)) == 0x80;
>> +return (c & 0xC0) == 0x80;
>>  }
>> 
>>  int



Re: ksh [emacs.c] -- simplify isu8cont()

2020-07-25 Thread ropers
On 25/07/2020, Martijn van Duren  wrote:
> This function is used throughout the OpenBSD tree and I think it's
> fine as it is. This way it's clearer to me that it's about byte
> 7 and 8

You mean bits 7 and 8 when counting from 1 from the right?

> and not have to do the math in my head to check if we
> might have botched it.
>
> Also compilers should be smart enough to optimize this out at
> compile-time anyway.
>
> martijn@

So the (0x80 | 0x40) was supposed to be *for* legibility?

IMHO it hurts legibility, but admittedly, it depends on who you are
and what you have memorised:
Finding (0x80 | 0x40) easier than 0xC0 assumes that people have nibble
translation between binary and hexadecimal memorised all the way up to
8 but *NOT* up to C.  And that they find dealing with two values and
an extra binary OR still less hassle than remembering that C is 1100.
Of course if that's your decision, that's your decision, but speaking
as a novice C programmer, I think 'C' is easier. ;D [0]

But while you're reading this, would you at least consider committing
the explanatory comment?  Not everybody is already familiar with how
UTF-8 works, and I think this comment above the function is still some
extra hand-holding beginners might find useful:
/* is octet a UTF-8 continuation byte? */

Also, while you're here, can anyone tell me what the zot in x_zots() /
x_zotc() actually stands for?  I thought it was "zero out the
{string|character}" when I looked at x_zots(), but then I doubted
myself once I saw that that's not strictly what x_zotc() actually
does.  Does anyone know?

Ian


footnote:
[0] Which latter thought, incidentally, is also why someone invented BCHS. ;)




> On Sat, 2020-07-25 at 17:40 +0100, ropers wrote:
>> Index: emacs.c
>> ===
>> RCS file: /cvs/src/bin/ksh/emacs.c,v
>> retrieving revision 1.87
>> diff -u -r1.87 emacs.c
>> --- emacs.c  8 May 2020 14:30:42 -   1.87
>> +++ emacs.c  25 Jul 2020 16:31:22 -
>> @@ -269,10 +269,11 @@
>>  { 0, 0, 0 },
>>  };
>>
>> +/* is octet a UTF-8 continuation byte? */
>>  int
>>  isu8cont(unsigned char c)
>>  {
>> -return (c & (0x80 | 0x40)) == 0x80;
>> +return (c & 0xC0) == 0x80;
>>  }
>>
>>  int
>>
>
>



Re: ksh [emacs.c] -- simplify isu8cont()

2020-07-25 Thread Martijn van Duren
On Sun, 2020-07-26 at 04:15 +0100, ropers wrote:
> On 25/07/2020, Martijn van Duren  wrote:
> > This function is used throughout the OpenBSD tree and I think it's
> > fine as it is. This way it's clearer to me that it's about byte
> > 7 and 8
> 
> You mean bits 7 and 8 when counting from 1 from the right?

obvious type-O is obvious.
> 
> > and not have to do the math in my head to check if we
> > might have botched it.
> > 
> > Also compilers should be smart enough to optimize this out at
> > compile-time anyway.
> > 
> > martijn@
> 
> So the (0x80 | 0x40) was supposed to be *for* legibility?
> 
> IMHO it hurts legibility, but admittedly, it depends on who you are
> and what you have memorised:
> Finding (0x80 | 0x40) easier than 0xC0 assumes that people have nibble
> translation between binary and hexadecimal memorised all the way up to
> 8 but *NOT* up to C.  And that they find dealing with two values and
> an extra binary OR still less hassle than remembering that C is 1100.
> Of course if that's your decision, that's your decision, but speaking
> as a novice C programmer, I think 'C' is easier. ;D [0]

I'm lazy, so I only remember the power exponents (1-4) and everthing
in between is just derivative. And since I and a handful of others
maintain this code I think we should do what we think is most readable
to us. If say schwarze@ were to change his mind and come up with this
patch I'd give it more serious thought on how hard it would be for me
to make the switch, but we're not going to change it because some
outsider might find it easier to read.
Don't get me wrong, I appreciate the patch and I do encourage you to
find new things to work on, but this one is just a matter of opinion
and in this case the people who actually maintain this code have a
different opinion and they tend to win, maybe your next patch will
convince us (I actually hope so).
> 
> But while you're reading this, would you at least consider committing
> the explanatory comment?  Not everybody is already familiar with how
> UTF-8 works, and I think this comment above the function is still some
> extra hand-holding beginners might find useful:
> /* is octet a UTF-8 continuation byte? */

Opinions differ, but I find comments like these just take up useless
screen space, because it basically reiterates what it states in the
function name.
> 
> Also, while you're here, can anyone tell me what the zot in x_zots() /
> x_zotc() actually stands for?  I thought it was "zero out the
> {string|character}" when I looked at x_zots(), but then I doubted
> myself once I saw that that's not strictly what x_zotc() actually
> does.  Does anyone know?

I have no idea, but it's been there since it's import in 1996 and
with only a quick search I can't find the original repository, so
no clue there.
> 
> Ian
> 
> 
> footnote:
> [0] Which latter thought, incidentally, is also why someone invented BCHS. ;)
> 
> 
> 
> 
> > On Sat, 2020-07-25 at 17:40 +0100, ropers wrote:
> > > Index: emacs.c
> > > ===
> > > RCS file: /cvs/src/bin/ksh/emacs.c,v
> > > retrieving revision 1.87
> > > diff -u -r1.87 emacs.c
> > > --- emacs.c   8 May 2020 14:30:42 -   1.87
> > > +++ emacs.c   25 Jul 2020 16:31:22 -
> > > @@ -269,10 +269,11 @@
> > >   { 0, 0, 0 },
> > >  };
> > > 
> > > +/* is octet a UTF-8 continuation byte? */
> > >  int
> > >  isu8cont(unsigned char c)
> > >  {
> > > - return (c & (0x80 | 0x40)) == 0x80;
> > > + return (c & 0xC0) == 0x80;
> > >  }
> > > 
> > >  int
> > >