Module Name: src
Committed By: riastradh
Date: Fri Aug 16 23:11:03 UTC 2024
Modified Files:
src/lib/libc/locale: mbrtoc16.3 mbrtoc32.3
Log Message:
mbrtoc16(3), mbrtoc32(3): Work on deturgidifying prose.
Still maybe not great but at least there's less jargon in most of the
text, without really losing any content.
PR lib/52374: <uchar.h> missing
To generate a diff of this commit:
cvs rdiff -u -r1.5 -r1.6 src/lib/libc/locale/mbrtoc16.3 \
src/lib/libc/locale/mbrtoc32.3
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
Modified files:
Index: src/lib/libc/locale/mbrtoc16.3
diff -u src/lib/libc/locale/mbrtoc16.3:1.5 src/lib/libc/locale/mbrtoc16.3:1.6
--- src/lib/libc/locale/mbrtoc16.3:1.5 Fri Aug 16 13:37:43 2024
+++ src/lib/libc/locale/mbrtoc16.3 Fri Aug 16 23:11:02 2024
@@ -1,4 +1,4 @@
-.\" $NetBSD: mbrtoc16.3,v 1.5 2024/08/16 13:37:43 riastradh Exp $
+.\" $NetBSD: mbrtoc16.3,v 1.6 2024/08/16 23:11:02 riastradh Exp $
.\"
.\" Copyright (c) 2024 The NetBSD Foundation, Inc.
.\" All rights reserved.
@@ -30,7 +30,7 @@
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh NAME
.Nm mbrtoc16
-.Nd Restartable multibyte to UTF-16 code unit conversion
+.Nd Restartable multibyte to UTF-16 conversion
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh LIBRARY
.Lb libc
@@ -50,20 +50,37 @@
.Sh DESCRIPTION
The
.Nm
-function attempts to decode a multibyte character sequence at
-.Fa s
-of up to
+decodes multibyte characters in the current locale and converts them to
+UTF-16, keeping state so it can restart after incremental progress.
+.Pp
+Each call to
+.Nm :
+.Bl -enum -compact
+.It
+examines up to
.Fa n
-bytes in the current locale, and yield the content as UTF-16 code
-units via the output parameter
-.Fa pc16 .
-.Fa pc16
-may be null, in which case no output is stored.
+bytes starting at
+.Fa s ,
+.It
+yields a UTF-16 code unit if available by storing it at
+.Li * Ns Fa pc16 ,
+.It
+saves state at
+.Fa ps ,
+and
+.It
+returns either the number of bytes consumed if any or a special return
+value.
+.El
+.Pp
+Specifically:
.Bl -bullet
.It
If the multibyte sequence at
.Fa s
-is invalid or an error occurs in decoding,
+is invalid after any previous input saved at
+.Fa ps ,
+or if an error occurs in decoding,
.Nm
returns
.Li (size_t)-1
@@ -75,7 +92,7 @@ If the multibyte sequence at
.Fa s
is still incomplete after
.Fa n
-bytes, including any previously processed input saved in
+bytes, including any previous input saved in
.Fa ps ,
.Nm
saves its state in
@@ -85,53 +102,33 @@ after all the input so far and returns
.It
If
.Nm
-finds the null scalar value at
-.Fa s ,
-then it stores zero at
+had previously decoded a multibyte character but has not yet yielded
+all the code units of its UTF-16 encoding, it stores the next UTF-16
+code unit at
.Li * Ns Fa pc16
-and returns zero.
+and returns
+.Li "(size_t)-3" .
.It
If
.Nm
-finds a nonnull scalar value in the Basic Multilingual Plane (BMP),
-i.e., a 16-bit scalar value, then it stores the scalar value at
-.Li * Ns Fa pc16 ,
-and returns the number of bytes it read from the input.
+decodes the null multibyte character, then it stores zero at
+.Li * Ns Fa pc16
+and returns zero.
.It
-If
+Otherwise,
.Nm
-finds a scalar value outside the BMP, then it:
-.Bl -dash -compact
-.It
-stores the scalar value's high surrogate code point at
-.Li * Ns Fa pc16 ;
-.It
-stores conversion state in
-.Fa ps
-to remember the rest of the pending scalar value; and
-.It
-returns the number of bytes it read from the input.
+decodes a single multibyte character, stores the first (and possibly
+only) code unit in its UTF-16 encoding at
+.Li * Ns Fa pc16 ,
+and returns the number of bytes consumed to decode the first multibyte
+character.
.El
-.It
+.Pp
If
-.Nm
-had previously found a scalar value outside the BMP, then, instead of
-any of the above options, it:
-.Bl -dash -compact
-.It
-stores the scalar value's low surrogate code point at
-.Li * Ns Fa pc16 ;
-.It
-consumes rest of the pending scalar value from the conversion state
-.Fa ps ;
-and
-.It
-returns
-.Li (size_t)-3
-to indicate that no bytes were consumed but a code unit was yielded
-nevertheless.
-.El
-.El
+.Fa pc16
+is a null pointer, nothing is stored, but the effects on
+.Fa ps
+and the return value are unchanged.
.Pp
If
.Fa s
@@ -174,6 +171,15 @@ and
which is initialized at program startup to the initial conversion
state.
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+.Sh IMPLEMENTATION NOTES
+On well-formed input, the
+.Nm
+function yields either a Unicode scalar value in the Basic Multilingual
+Plane (BMP), i.e., a 16-bit Unicode code point that is not a surrogate
+code point, or, over two successive calls, yields the high and low
+surrogate code points (in that order) of a Unicode scalar value outside
+the BMP.
+.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh RETURN VALUES
The
.Nm
@@ -197,26 +203,21 @@ if
consumed
.Ar i
bytes of input to decode the next multibyte character, yielding a
-(nonnull) UTF-16 code unit, either a Unicode scalar value in the BMP or
-a high surrogate code point.
+UTF-16 code unit.
.It Li (size_t)-3
.Bq continuation
if
.Nm
-consumed no bytes of input but yielded a (nonnull) UTF-16 code unit, a
-low surrogate code point, because the previous call to
-.Nm
-with
-.Fa ps
-had yielded a high surrogate code point for a Unicode scalar value
-outside the BMP.
+consumed no new bytes of input but yielded a UTF-16 code unit that was
+pending from previous input.
.It Li (size_t)-2
.Bq incomplete
if
.Nm
-found an incomplete multibyte character after all
+found only an incomplete multibyte sequence after all
.Fa n
-bytes of input, and saved its state to restart in the next call with
+bytes of input and any previous input, and saved its state to restart
+in the next call with
.Fa ps .
.It Li (size_t)-1
.Bq error
@@ -262,7 +263,8 @@ while (n) {
.Sh ERRORS
.Bl -tag -width Bq
.It Bq Er EILSEQ
-The multibyte sequence cannot be decoded as a Unicode scalar value.
+The multibyte sequence cannot be decoded in the current locale as a
+Unicode scalar value.
.It Bq Er EIO
An error occurred in loading the locale's character conversions.
.El
Index: src/lib/libc/locale/mbrtoc32.3
diff -u src/lib/libc/locale/mbrtoc32.3:1.5 src/lib/libc/locale/mbrtoc32.3:1.6
--- src/lib/libc/locale/mbrtoc32.3:1.5 Fri Aug 16 13:37:43 2024
+++ src/lib/libc/locale/mbrtoc32.3 Fri Aug 16 23:11:03 2024
@@ -1,4 +1,4 @@
-.\" $NetBSD: mbrtoc32.3,v 1.5 2024/08/16 13:37:43 riastradh Exp $
+.\" $NetBSD: mbrtoc32.3,v 1.6 2024/08/16 23:11:03 riastradh Exp $
.\"
.\" Copyright (c) 2024 The NetBSD Foundation, Inc.
.\" All rights reserved.
@@ -30,7 +30,7 @@
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh NAME
.Nm mbrtoc32
-.Nd Restartable multibyte to UTF-32 code unit conversion
+.Nd Restartable multibyte to UTF-32 conversion
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.Sh LIBRARY
.Lb libc
@@ -50,20 +50,39 @@
.Sh DESCRIPTION
The
.Nm
-function attempts to decode a multibyte character sequence at
-.Fa s
-of up to
+decodes multibyte characters in the current locale and converts them to
+Unicode scalar values (i.e., to UTF-32), keeping state so it can
+restart after incremental progress.
+.Pp
+Each call to
+.Nm :
+.Bl -enum -compact
+.It
+examines up to
.Fa n
-bytes in the current locale, and yield the content as UTF-32 code
-units, i.e., Unicode scalar values, via the output parameter
-.Fa pc32 .
-.Fa pc32
-may be null, in which case no output is stored.
+bytes starting at
+.Fa s ,
+.It
+yields a Unicode scalar value (i.e., a UTF-32 code unit) if available
+by storing it at
+.Li * Ns Fa pc32 ,
+.It
+saves state at
+.Fa ps ,
+and
+.It
+returns either the number of bytes consumed if any or a special return
+value.
+.El
+.Pp
+Specifically:
.Bl -bullet
.It
If the multibyte sequence at
.Fa s
-is invalid or an error occurs in decoding,
+is invalid after any previous input saved at
+.Fa ps ,
+or if an error occurs in decoding,
.Nm
returns
.Li (size_t)-1
@@ -75,7 +94,7 @@ If the multibyte sequence at
.Fa s
is still incomplete after
.Fa n
-bytes, including any previously processed input saved in
+bytes, including any previous input saved in
.Fa ps ,
.Nm
saves its state in
@@ -85,20 +104,26 @@ after all the input so far and returns
.It
If
.Nm
-finds the null scalar value at
-.Fa s ,
-then it stores zero at
+decodes the null multibyte character, then it stores zero at
.Li * Ns Fa pc32
and returns zero.
.It
-If
+Otherwise,
.Nm
-finds a nonnull scalar value, then it stores the scalar value at
+decodes a single multibyte character, stores its Unicode scalar value
+at
.Li * Ns Fa pc32 ,
-and returns the number of bytes it read from the input.
+and returns the number of bytes consumed to decode the first multibyte
+character.
.El
.Pp
If
+.Fa pc32
+is a null pointer, nothing is stored, but the effects on
+.Fa ps
+and the return value are unchanged.
+.Pp
+If
.Fa s
is a null pointer, the
.Nm
@@ -162,14 +187,15 @@ if
consumed
.Ar i
bytes of input to decode the next multibyte character, yielding a
-(nonnull) Unicode scalar value.
+Unicode scalar value.
.It Li (size_t)-2
.Bq incomplete
if
.Nm
-found an incomplete multibyte character after all
+found only an incomplete multibyte sequence after all
.Fa n
-bytes of input, and saved its state to restart in the next call with
+bytes of input and any previous input, and saved its state to restart
+in the next call with
.Fa ps .
.It Li (size_t)-1
.Bq error
@@ -211,10 +237,8 @@ while (n) {
.Sh ERRORS
.Bl -tag -width Bq
.It Bq Er EILSEQ
-A surrogate code point was passed.
-.It Bq Er EILSEQ
-The Unicode scalar value requested cannot be encoded as a multibyte
-sequence in the current locale.
+The multibyte sequence cannot be decoded in the current locale as a
+Unicode scalar value.
.It Bq Er EIO
An error occurred in loading the locale's character conversions.
.El