Re: Some generated header files are messed up on Alpine

2022-01-01 Thread Paul Eggert
I'd like to try the approach a bit more (especially as this prompted me 
to simplify it a bit :-). So I installed the first attached patch to 
Gnulib, to work around the bug in BusyBox 'sed'.


This 'sed' bug was new to me, so I installed the second attached patch 
to Autoconf, to document the portability problem.


Also, I emailed a bug report to the BusyBox maintainers 
.From 75541c6adaf6fc45541a35d2c8803b9b68f2a7fc Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sat, 1 Jan 2022 15:30:38 -0800
Subject: [PATCH] =?UTF-8?q?gen-header:=20port=20to=20BusyBox=20=E2=80=98se?=
 =?UTF-8?q?d=E2=80=99?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Problem reported by Tim Rühsen in:
https://lists.gnu.org/r/bug-gnulib/2022-01/msg4.html
* modules/gen-header (SED_HEADER_NOEDIT): Replace instead of prepend.
(SED_HEADER_STDOUT, SED_HEADER_TO_AT_t): Adjust to that change.
Do not use ‘w foo’ twice in the same script, as BusyBox ‘sed’
mistakenly opens ‘foo’ for output twice, thus losing some output.
---
 ChangeLog  | 10 ++
 modules/gen-header | 17 +
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 21390af8fd..3d837b1c18 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,13 @@
+2022-01-01  Paul Eggert  
+
+	gen-header: port to BusyBox ‘sed’
+	Problem reported by Tim Rühsen in:
+	https://lists.gnu.org/r/bug-gnulib/2022-01/msg4.html
+	* modules/gen-header (SED_HEADER_NOEDIT): Replace instead of prepend.
+	(SED_HEADER_STDOUT, SED_HEADER_TO_AT_t): Adjust to that change.
+	Do not use ‘w foo’ twice in the same script, as BusyBox ‘sed’
+	mistakenly opens ‘foo’ for output twice, thus losing some output.
+
 2022-01-01  Bruno Haible  
 
 	striconveh: Support an error handler that produces a Unicode U+FFFD.
diff --git a/modules/gen-header b/modules/gen-header
index ab1858d65f..feb711b5c6 100644
--- a/modules/gen-header
+++ b/modules/gen-header
@@ -11,24 +11,17 @@ Depends-on:
 configure.ac:
 
 Makefile.am:
-# In 'sed', prepend a "DO NOT EDIT" comment to the pattern space.
-SED_HEADER_NOEDIT = s,^,/* DO NOT EDIT! GENERATED AUTOMATICALLY! */,
+# In 'sed', replace the pattern space with a "DO NOT EDIT" comment.
+SED_HEADER_NOEDIT = s,.*,/* DO NOT EDIT! GENERATED AUTOMATICALLY! */,
 
 # '$(SED_HEADER_STDOUT) -e "..."' runs 'sed' but first outputs "DO NOT EDIT".
-SED_HEADER_STDOUT = sed \
-  -e x \
-  -e '1$(SED_HEADER_NOEDIT)p' \
-  -e x
+SED_HEADER_STDOUT = sed -e 1h -e '1$(SED_HEADER_NOEDIT)' -e 1G
 
 # '$(SED_HEADER_TO_AT_t) FILE' copies FILE to $@-t, prepending a leading
 # "DO_NOT_EDIT".  Although this could be done more simply via:
 #	SED_HEADER_TO_AT_t = $(SED_HEADER_STDOUT) > $@-t
-# the -n and 'w's avoid a fork+exec, at least when GNU Make is used.
-SED_HEADER_TO_AT_t = sed -n \
-  -e x \
-  -e '1$(SED_HEADER_NOEDIT)w $@-t' \
-  -e x \
-  -e 'w $@-t'
+# the -n and 'w' avoid a fork+exec, at least when GNU Make is used.
+SED_HEADER_TO_AT_t = $(SED_HEADER_STDOUT) -n -e 'w $@-t'
 
 # Use $(gl_V_at) instead of $(AM_V_GEN) or $(AM_V_at) on a line that
 # is its recipe's first line if and only if @NMD@ lines are absent.
-- 
2.32.0

From 1953a1461fee16e0fa4502156fb43e941920ca03 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sat, 1 Jan 2022 18:08:15 -0800
Subject: [PATCH] doc: document BusyBox sed w bug

---
 doc/autoconf.texi | 24 
 1 file changed, 24 insertions(+)

diff --git a/doc/autoconf.texi b/doc/autoconf.texi
index 17a6326e..04730dcc 100644
--- a/doc/autoconf.texi
+++ b/doc/autoconf.texi
@@ -20170,6 +20170,30 @@ s/.*/deleted/g
 :end
 @end example
 
+@item @command{sed} (@samp{w})
+@c ---
+@prindex @command{sed} (@samp{w})
+
+When a script contains multiple commands to write lines to the same
+output file, BusyBox @command{sed} mistakenly opens a separate output
+stream for each command.  This can cause one of the commands to ``win''
+and the others to ``lose'', in the sense that their output is discarded.
+For example:
+
+@example
+sed -n -e '
+  /a/w xxx
+  /b/w xxx
+' <

Re: Some generated header files are messed up on Alpine

2022-01-01 Thread Jim Meyering
On Sat, Jan 1, 2022 at 12:15 PM Bruno Haible  wrote:
> Paul,
>
> I suggest the attached patch. Objections?

Looks fine to me. Thanks for the speedy fix.



Re: Some generated header files are messed up on Alpine

2022-01-01 Thread Bruno Haible
Paul,

I suggest the attached patch. Objections?

Bruno
From 4e30d9715e44f2aade3927e762a4b1fee340962e Mon Sep 17 00:00:00 2001
From: Bruno Haible 
Date: Sat, 1 Jan 2022 21:12:21 +0100
Subject: [PATCH] gen-header: Fix major bug on Alpine Linux (regression
 2021-12-25).
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reported by Tim Rühsen  in
.

* modules/gen-header (Makefile.am): Define SED_HEADER_TO_AT_t without an
optimization that does not work with 'sed' from BusyBox.
---
 ChangeLog  |  8 
 modules/gen-header | 12 
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 21390af8f..9337c8830 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2022-01-01  Bruno Haible  
+
+	gen-header: Fix major bug on Alpine Linux (regression 2021-12-25).
+	Reported by Tim Rühsen  in
+	.
+	* modules/gen-header (Makefile.am): Define SED_HEADER_TO_AT_t without an
+	optimization that does not work with 'sed' from BusyBox.
+
 2022-01-01  Bruno Haible  
 
 	striconveh: Support an error handler that produces a Unicode U+FFFD.
diff --git a/modules/gen-header b/modules/gen-header
index ab1858d65..05a1480d6 100644
--- a/modules/gen-header
+++ b/modules/gen-header
@@ -21,14 +21,10 @@ SED_HEADER_STDOUT = sed \
   -e x
 
 # '$(SED_HEADER_TO_AT_t) FILE' copies FILE to $@-t, prepending a leading
-# "DO_NOT_EDIT".  Although this could be done more simply via:
-#	SED_HEADER_TO_AT_t = $(SED_HEADER_STDOUT) > $@-t
-# the -n and 'w's avoid a fork+exec, at least when GNU Make is used.
-SED_HEADER_TO_AT_t = sed -n \
-  -e x \
-  -e '1$(SED_HEADER_NOEDIT)w $@-t' \
-  -e x \
-  -e 'w $@-t'
+# "DO_NOT_EDIT".
+# Beware of optimizations that use 'w' to avoid a fork+exec; such optimizations
+# don't work on Alpine Linux (which has 'sed' from BusyBox).
+SED_HEADER_TO_AT_t = $(SED_HEADER_STDOUT) > $@-t
 
 # Use $(gl_V_at) instead of $(AM_V_GEN) or $(AM_V_at) on a line that
 # is its recipe's first line if and only if @NMD@ lines are absent.
-- 
2.25.1



Re: Some generated header files are messed up on Alpine

2022-01-01 Thread Bruno Haible
Hi Tim,

> Some generated header files look like
> 
> bash-5.1# head -3 lib/uniwidth.h
> /* DO NOT EDIT! GENERATED AUTOMATICALLY! */
>   2001-2002, 2005, 2007, 2009-2021 Free Software Foundation,
> Inc.
> 
> This obviously won't compile.

I reproduce it on Alpine Linux 3.14, with a tarball created on a glibc
system:
  ./gnulib-tool --create-testdir --dir=../testdir uniwidth/width

The consequence is severe:
   No tarball created with the current gnulib will build on Alpine Linux!


The problem originates in the commit from 2021-12-25 that introduced
module 'gen-header'. I had tested it on several platforms [1], but Alpine
Linux was not among them.

Alpine Linux ships with many buggy/deficient utilities, and 'sed' is
apparently one of them. You find its source code here: [2].

Paul, can you think of alternate ways to define SED_HEADER_TO_AT_t ?

Bruno

[1] https://lists.gnu.org/archive/html/bug-gnulib/2021-12/msg00147.html
[2] https://git.busybox.net/busybox/tree/editors/sed.c






Re: [striconveh] Error handling and Unicode replacement character

2022-01-01 Thread Bruno Haible
Marc Nieper-Wißkirchen wrote on 2021-12-30:
> The striconveh module and related modules offer an error handler
> argument. The current possible values are:
> 
> iconveh_error
> iconveh_question_mark
> iconveh_escape_sequence
> 
> The second option replaces any unconvertible character with a question mark 
> "?".
> 
> I would like to request to add a fourth option, say,
> iconveh_replacement_character, which is like iconveh_question_mark but
> uses U+FFFD whenever the target codeset is a Unicode codeset.

That's a good suggestion, as nowadays people are frequently converting
to UTF-8 or GB18030. Implemented as follows.


2022-01-01  Bruno Haible  

striconveh: Support an error handler that produces a Unicode U+FFFD.
Suggested by Marc Nieper-Wißkirchen in
.
* lib/iconveh.h (iconveh_replacement_character): New enum value.
* lib/striconveh.c (mem_cd_iconveh_internal): When the handler is
iconveh_replacement_character, try to produce U+FFFD when possible,
instead of '?'.
* tests/test-striconveh.c (main): Add GB18030 tests. Test also
iconveh_replacement_character.

diff --git a/lib/iconveh.h b/lib/iconveh.h
index d321d34cb..058f68ca2 100644
--- a/lib/iconveh.h
+++ b/lib/iconveh.h
@@ -29,7 +29,10 @@ enum iconv_ilseq_handler
 {
   iconveh_error,/* return and set errno = EILSEQ */
   iconveh_question_mark,/* use one '?' per unconvertible character */
-  iconveh_escape_sequence   /* use escape sequence \u or \U */
+  iconveh_escape_sequence,  /* use escape sequence \u or \U */
+  iconveh_replacement_character /* use one U+FFFD per unconvertible character
+   if that fits in the target encoding,
+   otherwise one '?' */
 };
 
 
diff --git a/lib/striconveh.c b/lib/striconveh.c
index 4aa8a2f07..612c38c3e 100644
--- a/lib/striconveh.c
+++ b/lib/striconveh.c
@@ -457,13 +457,18 @@ mem_cd_iconveh_internal (const char *src, size_t srclen,
 if (cd2 == (iconv_t)(-1))
   {
 /* TO_CODESET is UTF-8.  */
-/* Error handling can produce up to 1 byte of output.  */
-if (length + 1 + extra_alloc > allocated)
+/* Error handling can produce up to 1 or 3 bytes of
+   output.  */
+size_t extra_need =
+  (handler == iconveh_replacement_character ? 3 : 1);
+if (length + extra_need + extra_alloc > allocated)
   {
 char *memory;
 
 allocated = 2 * allocated;
-if (length + 1 + extra_alloc > allocated)
+if (length + extra_need + extra_alloc > allocated)
+  allocated = 2 * allocated;
+if (length + extra_need + extra_alloc > allocated)
   abort ();
 if (result == initial_result)
   memory = (char *) malloc (allocated);
@@ -482,7 +487,7 @@ mem_cd_iconveh_internal (const char *src, size_t srclen,
 grow = false;
   }
 /* The input is invalid in FROM_CODESET.  Eat up one byte
-   and emit a question mark.  */
+   and emit a replacement character or a question mark.  */
 if (!incremented)
   {
 if (insize == 0)
@@ -490,8 +495,19 @@ mem_cd_iconveh_internal (const char *src, size_t srclen,
 inptr++;
 insize--;
   }
-result[length] = '?';
-length++;
+if (handler == iconveh_replacement_character)
+  {
+/* U+FFFD in UTF-8 encoding.  */
+result[length+0] = '\357';
+result[length+1] = '\277';
+result[length+2] = '\275';
+length += 3;
+  }
+else
+  {
+result[length] = '?';
+length++;
+  }
   }
 else
   goto indirectly;
@@ -594,7 +610,7 @@ mem_cd_iconveh_internal (const char *src, size_t srclen,
   {
 const bool slowly = (offsets != NULL || handler == iconveh_error);
 # define utf8bufsize 4096 /* may also be smaller or larger than tmpbufsize */
-char utf8buf[utf8bufsize + 1];
+char utf8buf[utf8bufsize + 3];
 size_t utf8len = 0;
 const char *in1ptr = src;
 size_t in1size = srclen;
@@ -682,8 +698,8 @@ mem_cd_iconveh_internal (const char *src, size_t srclen,

Some generated header files are messed up on Alpine

2022-01-01 Thread Tim Rühsen

Hi,

I just updated gnulib to commit 2671376bc and have build issues on 
Alpine (using docker; FROM alpine:latest).

Not saying this commit is faulty, I didn't update gnulib since April 2021.

Some generated header files look like

bash-5.1# head -3 lib/uniwidth.h
/* DO NOT EDIT! GENERATED AUTOMATICALLY! */
 2001-2002, 2005, 2007, 2009-2021 Free Software Foundation,
   Inc.

This obviously won't compile.

Another example is:

bash-5.1# head -3 lib/unitypes.h
/* DO NOT EDIT! GENERATED AUTOMATICALLY! */
niString library.
   Copyright (C) 2002, 2005-2006, 2009-2021 Free Software Foundation, Inc.

I tested with ./bootstrap from gnulib/build-aux/.
This issue does not appear on Debian Linux (stable, testing, unstable).

If this is not reproducible or the issue is not obvious, please let me 
know so that I can provide a Dockerfile (or docker image) plus 
instructions on how to reproduce.


Regards, Tim


OpenPGP_signature
Description: OpenPGP digital signature


[PATCH] maint: fix ‘make update-copyright’ on symlinks

2022-01-01 Thread Paul Eggert
After running ‘make update-copyright’ I noticed that it
incorrectly replaced a couple of symlinks with their contents.
* Makefile (update-copyright): Do not update symlinks.
* etc/license-notices/GPL, etc/license-notices/LGPL:
Change these back to symlinks.
---
 ChangeLog|  9 
 Makefile |  3 +++
 etc/license-notices/GPL  | 44 +---
 etc/license-notices/LGPL | 29 +-
 4 files changed, 14 insertions(+), 71 deletions(-)
 mode change 100644 => 12 etc/license-notices/GPL
 mode change 100644 => 12 etc/license-notices/LGPL

diff --git a/ChangeLog b/ChangeLog
index e2f467e75b..28d95dedec 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@
+2022-01-01  Paul Eggert  
+
+   maint: fix ‘make update-copyright’ on symlinks
+   After running ‘make update-copyright’ I noticed that it
+   incorrectly replaced a couple of symlinks with their contents.
+   * Makefile (update-copyright): Do not update symlinks.
+   * etc/license-notices/GPL, etc/license-notices/LGPL:
+   Change these back to symlinks.
+
 2021-12-31  Bruno Haible  
 
unistdio: Prefer newer version to older, buggy one.
diff --git a/Makefile b/Makefile
index 913407fa78..85362c8c2c 100644
--- a/Makefile
+++ b/Makefile
@@ -173,6 +173,9 @@ update-copyright:
done > $$exempt;\
git ls-files tests/unictype >> $$exempt;\
git ls-files doc/INSTALL* >> $$exempt;  \
+   for file in $$(git ls-files); do\
+ test ! -h $$file || echo $$file;  \
+   done >> $$exempt;   \
git ls-files | grep -vFf $$exempt   \
  | xargs grep -L '^/\*.*GENERATED AUTOMATICALLY'   \
  | UPDATE_COPYRIGHT_MAX_LINE_LENGTH=79 \
diff --git a/etc/license-notices/GPL b/etc/license-notices/GPL
deleted file mode 100644
index f6b0d67689..00
--- a/etc/license-notices/GPL
+++ /dev/null
@@ -1,43 +0,0 @@
-
-   This file is free software: you can redistribute it and/or modify
-   it under the terms of the GNU General Public License as published
-   by the Free Software Foundation; either version 3 of the License,
-   or (at your option) any later version.
-
-   This file is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-   GNU General Public License for more details.
-
-   You should have received a copy of the GNU General Public License
-   along with this program.  If not, see .  */
-
-
-
- * This file is free software: you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published
- * by the Free Software Foundation; either version 3 of the License,
- * or (at your option) any later version.
- *
- * This file is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program.  If not, see .
-
-
-
-# This file is free software: you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published
-# by the Free Software Foundation; either version 3 of the License,
-# or (at your option) any later version.
-#
-# This file is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-# GNU General Public License for more details.
-#
-# You should have received a copy of the GNU General Public License
-# along with this program.  If not, see .
diff --git a/etc/license-notices/GPL b/etc/license-notices/GPL
new file mode 12
index 00..fbd0cdcabe
--- /dev/null
+++ b/etc/license-notices/GPL
@@ -0,0 +1 @@
+GPLv3+
\ No newline at end of file
diff --git a/etc/license-notices/LGPL b/etc/license-notices/LGPL
deleted file mode 100644
index 5126fcf819..00
--- a/etc/license-notices/LGPL
+++ /dev/null
@@ -1,28 +0,0 @@
-
-   This file is free software: you can redistribute it and/or modify
-   it under the terms of the GNU Lesser General Public License as
-   published by the Free Software Foundation; either version 3 of the
-   License, or (at your option) any later version.
-
-   This file is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR 

Re: Bytewise u??_conv_from_encoding

2022-01-01 Thread Marc Nieper-Wißkirchen
Hi Bruno,

thanks for your insights, valuable as always.

Am Sa., 1. Jan. 2022 um 13:57 Uhr schrieb Bruno Haible :
>
> Hi Marc,
>
> > The demand to read a file (in local encoding) and to decode it
> > incrementally seems a typical one.
>
> There are four ways to satisfy this demand.
>
> (A) Using a pipe at the shell level:
>   iconv -t UTF-8 | my-program
>
> (B) Using a programming language that has a coroutines concept.
> This way, both the decoder and the consumer can be programmed in
> a straightforward manner.
>
> (C) In C, with multiple threads.
>
> (D) In C, with a decoder programmed in a straightforward manner
> and a consumer that is written as a callback with state.
>
> (E) In C, with a decoder written as a callback with state
> and a consumer programmed in a straightforward manner.
>
> > Thus, I am wondering whether it makes sense to offer a stateful
> > decoder that takes byte by byte and signals as soon as a decoded byte
> > sequence is ready.
>
> It seems that you are thinking of approach (D).

> I think (D) is the worst, because writing application code in a callback
> style with state is hard and error-prone. I would favour (E) instead,
> if (A) is not possible.

If I understand your classification correctly, I meant something more
like (E) than (D), I think. As an interface, I would propose would be
something along the following lines:

decoder_t d = decoder_create (iconveh_t *cd);
switch (decoder_push (d, byte))
  {
  case DECODER_BYTE_READ:
char *res = decoder_result (d);
size_t len = decoder_length (d);
...
  case DECODER_EOF:
...
  case DECODER_INCOMPLETE:
...
  case DECODER_ERROR:
...
  }
...
decoder_destroy (d);

> (B) means to use a different programming language. I can't recommend C++ [1].

The main problem I see with C++'s coroutines is that they are
stackless coroutines; their expressiveness is tiny compared to
languages with full coroutine support, to say nothing of programming
languages like Scheme with its first-class continuations.

> (C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
> pipe-filter-gi.c. Generally, threads are overkill when all you need are
> coroutines.

I agree. Unfortunately, Posix's response to dropping makecontext and
friends seems to be to use threads. It would be great if C had a
lightweight context-swapping mechanism.

> Now, when implementing (E), it will be useful to have some kind of "abstract
> input stream" data type. Such a thing does not exist in C, for historical
> reasons. But it can be done similarly to the "abstract output stream" data
> type that is at the heart of GNU libtextstyle [2][3][4].

I will have to take a closer look at that library.

> > On top of that, a decoding Unicode mbfile interface can be built, say 
> > ucfile.
>
> One of the problems of byte-by-byte decoding is that it's inefficient. It's
> way more efficient to do the same task (decoding, consuming) on an entire
> buffer of, say, at least 1 KiB. Buffering minimizes the context switches and
> time spent in function entry/exit. That needs to be considered in the design.

The mbfile interface tries hard not to read more than necessary in
advance to support interactive streams. That possibility should be
preserved, I think.

In my API proposal above, decoder_push can be redesigned to look as follows:

int decoder_push (decoder_t decoder, char *src, size_t srclen)

By the way, libunistring's u??_conv_from_encoding does not seem to be
adapted to consuming buffers. The problem is that one doesn't know in
advance where boundaries of multi-byte sequences are so
u??_conv_from_encoding will likely signal a decoding error.

What would be more helpful would be a version of
u??_conv_from_encoding that returns the decoded part of the string
before the invalid sequence and that gives the position of the invalid
sequence. For piping purposes, it would still not be very comfortable
because one then would have to copy by hand the undecoded part of the
string to the beginning of the buffer and refill the rest of the
buffer from the source.

Marc



Re: Bytewise u??_conv_from_encoding

2022-01-01 Thread Bruno Haible
Hi Marc,

> The demand to read a file (in local encoding) and to decode it
> incrementally seems a typical one.

There are four ways to satisfy this demand.

(A) Using a pipe at the shell level:
  iconv -t UTF-8 | my-program

(B) Using a programming language that has a coroutines concept.
This way, both the decoder and the consumer can be programmed in
a straightforward manner.

(C) In C, with multiple threads.

(D) In C, with a decoder programmed in a straightforward manner
and a consumer that is written as a callback with state.

(E) In C, with a decoder written as a callback with state
and a consumer programmed in a straightforward manner.

> Thus, I am wondering whether it makes sense to offer a stateful
> decoder that takes byte by byte and signals as soon as a decoded byte
> sequence is ready.

It seems that you are thinking of approach (D).

I think (D) is the worst, because writing application code in a callback
style with state is hard and error-prone. I would favour (E) instead,
if (A) is not possible.

(B) means to use a different programming language. I can't recommend C++ [1].

(C) is possible, but complex. See e.g. gnulib's pipe-filter-ii.c or
pipe-filter-gi.c. Generally, threads are overkill when all you need are
coroutines.

Now, when implementing (E), it will be useful to have some kind of "abstract
input stream" data type. Such a thing does not exist in C, for historical
reasons. But it can be done similarly to the "abstract output stream" data
type that is at the heart of GNU libtextstyle [2][3][4].

> On top of that, a decoding Unicode mbfile interface can be built, say ucfile.

One of the problems of byte-by-byte decoding is that it's inefficient. It's
way more efficient to do the same task (decoding, consuming) on an entire
buffer of, say, at least 1 KiB. Buffering minimizes the context switches and
time spent in function entry/exit. That needs to be considered in the design.

Bruno

[1] https://en.cppreference.com/w/cpp/language/coroutines
[2] 
https://www.gnu.org/software/gettext/libtextstyle/manual/html_node/The-output-stream-hierarchy.html
[3] 
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.h
[4] 
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=libtextstyle/gnulib-local/lib/iconv-ostream.oo.c






Bytewise u??_conv_from_encoding

2022-01-01 Thread Marc Nieper-Wißkirchen
The demand to read a file (in local encoding) and to decode it
incrementally seems a typical one.

With Gnulib, this can be done using the mbfile module to read in the
multibytes byte-by-byte and then using the striconveh module to decode
the multibytes in, say, UTF-8 or UTF-32.

This, however, doesn't seem to be very efficient because the
multibytes have to be investigated at least twice; once by the mbfile
iterator and once by the striconveh iterator.

Thus, I am wondering whether it makes sense to offer a stateful
decoder that takes byte by byte and signals as soon as a decoded byte
sequence is ready.

On top of that, a decoding Unicode mbfile interface can be built, say ucfile.

Thanks,

Marc