Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

2020-10-05 Thread Laslo Hunhold
On Fri, 2 Oct 2020 18:10:21 +0200
Richard Ipsum  wrote:

Dear Richard,

> I'm happy to drop this patch from the series but libgrapheme isn't in
> sbase's tree and it doesn't seem reasonable to expect users of sbase
> to install libgrapheme themselves?

this is a good point and I think you should not drop this patch. I see
it like this: sbase has an out-of-tree-copy of libutf in its
repository, and in regard to grapheme cluster handling, nothing happens
between the unicode versions.

One can think about pulling in libgrapheme, porting the
Unicode-sections piece by piece and then dropping libutf. Admittedly,
libgrapheme does not have all of the high-level-functions, but in many
cases they are trivial to replace or not necessary.

One example here are the, well-meant, is*rune() functions, which for
instance are used in tr(1) to map classes. However, the more I think
about it, they are just insufficient to map grapheme clusters (because
they operate on codepoints only). To give an example, if you have a
grapheme cluster of multiple characters, where one is a digit rune, but
the others aren't, does this mean that the entire grapheme cluster
should be dropped when tr(1) is configured to do so? The same
also applies to upperrune.

> I'm not at all familiar with libgrapheme either and I don't know what
> the trivial counterexamples Laslo refers to are, maybe it's better if
> he takes over this part of the fix?

No matter what Michael decides to do (i.e. import libgrapheme and port
or stay with libutf), your patch makes sense, because importing
libgrapheme would not mean immediately dropping libtuf (and it would be
more of a porting process) and they would stay side-by-side for a while.

With best regards

Laslo



Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

2020-10-02 Thread Michael Forney
On 2020-10-02, Richard Ipsum  wrote:
> I'm happy to drop this patch from the series but libgrapheme isn't in
> sbase's tree and it doesn't seem reasonable to expect users of sbase to
> install libgrapheme themselves?

For now, I think it's fine to keep your patch (it is definitely an
improvement). Maybe you could add an entry to the TODO about using
libgrapheme to count grapheme clusters instead?



Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

2020-10-02 Thread Richard Ipsum
On Thu, Oct 01, 2020 at 08:52:34AM +0200, Laslo Hunhold wrote:
> On Wed, 30 Sep 2020 22:41:47 -0700
> Michael Forney  wrote:
> 
> Dear Michael,
> 
> > POSIX says we should be counting column positions rather than
> > codepoints, but I think that might be rather difficult to get right
> > and this is probably an improvement already.
> > 
> > I know Laslo has studied this area for libgrapheme, so maybe he has
> > suggestions.
> 
> if you want to do it 100% right, there's no way around using
> libgrapheme (or another library handling grapheme clusters like icu,
> but I bet there's none nearly as lightweight as libgrapheme). Counting
> codepoints is only halfway there and there are trivial counterexamples
> which prove that this is not the complete solution and there are
> discrepancies.
> 
> On the other hand, in the western world, most grapheme clusters are
> emojis and certain cases with more complex writing systems. It's a much
> different matter when you go to asia or africa, where you can't really
> properly implement many very popular writing systems (like Hangul)
> without using grapheme clusters.
> Most importantly in general though are if you're processing
> denormalized input (i.e. where everything is broken down as much as
> possible, for example the single codepoint
> (=1-codepoint-grapheme-cluster) "ä" is turned into the codepoint "a"
> with an umlaut modifier, making it a 2-codepoint-grapheme-cluster),
> leading to a lot of gotchas, inconsistencies and maybe even security
> problems.
> 
> All in all though, codepoint-counting is a step in the right direction,
> but definitely not exhaustive, especially as time moves on and more and
> more people are using the higher unicode planes for data. If you really
> want to do it right, you must handle grapheme clusters, and libgrapheme
> is actually very fast and should even be faster than the Rune-solution
> using libutf.h, because it works on the byte-level rather than the
> Codepoint-level.
> 
> With best regards
> 
> Laslo
> 

I'm happy to drop this patch from the series but libgrapheme isn't in
sbase's tree and it doesn't seem reasonable to expect users of sbase to
install libgrapheme themselves?

I'm not at all familiar with libgrapheme either and I don't know what
the trivial counterexamples Laslo refers to are, maybe it's better if he
takes over this part of the fix?

Thanks,
Richard



Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

2020-10-01 Thread Laslo Hunhold
On Wed, 30 Sep 2020 22:41:47 -0700
Michael Forney  wrote:

Dear Michael,

> POSIX says we should be counting column positions rather than
> codepoints, but I think that might be rather difficult to get right
> and this is probably an improvement already.
> 
> I know Laslo has studied this area for libgrapheme, so maybe he has
> suggestions.

if you want to do it 100% right, there's no way around using
libgrapheme (or another library handling grapheme clusters like icu,
but I bet there's none nearly as lightweight as libgrapheme). Counting
codepoints is only halfway there and there are trivial counterexamples
which prove that this is not the complete solution and there are
discrepancies.

On the other hand, in the western world, most grapheme clusters are
emojis and certain cases with more complex writing systems. It's a much
different matter when you go to asia or africa, where you can't really
properly implement many very popular writing systems (like Hangul)
without using grapheme clusters.
Most importantly in general though are if you're processing
denormalized input (i.e. where everything is broken down as much as
possible, for example the single codepoint
(=1-codepoint-grapheme-cluster) "ä" is turned into the codepoint "a"
with an umlaut modifier, making it a 2-codepoint-grapheme-cluster),
leading to a lot of gotchas, inconsistencies and maybe even security
problems.

All in all though, codepoint-counting is a step in the right direction,
but definitely not exhaustive, especially as time moves on and more and
more people are using the higher unicode planes for data. If you really
want to do it right, you must handle grapheme clusters, and libgrapheme
is actually very fast and should even be faster than the Rune-solution
using libutf.h, because it works on the byte-level rather than the
Codepoint-level.

With best regards

Laslo



Re: [hackers] [sbase][PATCH 4/5] fold: fix handling of multibyte characters

2020-09-30 Thread Michael Forney
POSIX says we should be counting column positions rather than
codepoints, but I think that might be rather difficult to get right
and this is probably an improvement already.

I know Laslo has studied this area for libgrapheme, so maybe he has
suggestions.

On 2020-06-20, Richard Ipsum  wrote:
> diff --git a/fold.c b/fold.c
> index a5f320b..169064b 100644
> --- a/fold.c
> +++ b/fold.c
> @@ -7,6 +7,7 @@
>
>  #include "text.h"
>  #include "util.h"
> +#include "utf.h"
>
>  static intbflag = 0;
>  static intsflag = 0;
> @@ -15,11 +16,14 @@ static size_t width = 80;
>  static void
>  foldline(struct line *l) {
>   size_t i, col, last, spacesect, len;
> + Rune r;
> + int runelen;
> +
> + for (i = 0, last = 0, col = 0, spacesect = 0; i < l->len; i += runelen) 
> {
>

There's a extraneous blank line here.

> - for (i = 0, last = 0, col = 0, spacesect = 0; i < l->len; i++) {
> - if (!UTF8_POINT(l->data[i]) && !bflag)
> - continue;
>   if (col >= width && (l->data[i] != '\b' || bflag)) {
> + if (bflag && col > width)
> + i -= runelen;   /* never split a character */
>   len = ((sflag && spacesect) ? spacesect : i) - last;
>   if (fwrite(l->data + last, 1, len, stdout) != len)
>   eprintf("fwrite :");
> @@ -29,6 +33,7 @@ foldline(struct line *l) {
>   col = 0;
>   spacesect = 0;
>   }
> + runelen = chartorune(, l->data + i);

I think we should use charntorune(, l->data + i, l->len - i) here.

>   if (sflag && isspace(l->data[i]))
>   spacesect = i + 1;
>   if (!bflag && iscntrl(l->data[i])) {
> @@ -46,7 +51,7 @@ foldline(struct line *l) {
>   break;
>   }
>   } else {
> - col++;
> + col += bflag ? runelen : 1;
>   }
>   }
>   if (l->len - last)