Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
[adding glibc and Austin group lists]

On 3/6/21 12:50 PM, Bruno Haible wrote:
> Hi,
> 
> Carol Bouchard wrote in 
> :
>> A change that was introduced is the
>> #define SIGSTKSZ is no longer a statically defined variable.  It's value can
>> only be determined at run time.
>>
>> # define SIGSTKSZ sysconf (_SC_SIGSTKSZ)
> 
> This is invalid. POSIX:2018 [1] defines two lists of macros:
> 
>   1) "The  header shall define the following macros which shall
>   expand to integer constant expressions that need not be usable in
>   #if preprocessing directives:"
> 
>   2) "The  header shall also define the following symbolic 
> constants:"
> 
> SIGSTKSZ is in the second list. This implies that it must expand to a constant
> and that it must be usable in #if preprocessing directives.

The question becomes whether glibc is in violation of POSIX for having
made the change, or whether POSIX needs to be amended to allow SIGSTKSZ
to be non-preprocessor-safe and/or non-constant.

> 
> Besides being invalid, it is also not needed. The alternate signal stack
> needs to be dimensioned according to the CPU and ABI that is in use. For 
> example,
> SPARC processors tend to use much more stack space than x86 per function
> invocation. Similarly, 64-bit execution on a bi-arch CPU tends to use more 
> stack
> space than 32-bit execution, because return addresses and other pointers are
> 64-bit vs. 32-bit large. But once you have fixed the CPU and the ABI, there is
> no ambiguity any more.
> 
>> This affects m4 code since the code assumes a statically defined variable 
>> which
>> can be determined at preprocessor time.
> 
> POSIX guarantees this assumption.
> 
>> Please advise how I can get past this.
> 
> Fix your .

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=6c57d320484988e87e446e2e60ce42816bf51d53
shows where glibc made the change, and I've now seen reports of several
projects failing to build when using glibc with this change included.

> 
> Bruno
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 9:26 AM, Andreas Schwab wrote:
> On Mär 09 2021, Eric Blake via Libc-alpha wrote:
> 
>> The question becomes whether glibc is in violation of POSIX for having
>> made the change, or whether POSIX needs to be amended to allow SIGSTKSZ
>> to be non-preprocessor-safe and/or non-constant.
> 
> POSIX already allows non-preprocessor-safe.

True, but expanding 'SIGSTKSZ' to 'sysconf (_SC_SIGSTKSZ)' is not a
symbolic constant., as it is not "a compile-time constant expression
with an integer type', per definition 3.380.

Looks like this discussion is happening in parallel in:
https://sourceware.org/bugzilla/show_bug.cgi?id=20305

I can open a defect against POSIX if we decide that is needed, but want
some consensus first on whether it is glibc's change that went too far,
or POSIX's requirements that are too restrictive for what glibc wants to do.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 10:14 AM, shwaresyst wrote:
> 
> To me that looks like a conformance violation and should be reverted. There 
> is no _SC_SIGSTKSZ defined in  by the standard, to begin with, so 
> that use of sysconf() is a non-portable extension on its own.

Portable apps can't use _SC_SIGSTKSZ, but the standard generally permits
implementations to define further constants.  Then again, re-reading XSH
2.2.2:

" Implementations may add symbols to the headers shown in the following
table, provided the identifiers for those symbols either:

Begin with the corresponding reserved prefixes in the table, or
..."

but the table lacks a row for  with _CS_* and _SC_* constants.
 Looks like you found an independent defect.

> 
> I could see the definition of SIGSTKSZ being changed to the static minimum a 
> particular processor requires, or is initially allocated as a 'safe' amount, 
> rather than static "default size", and moving SIGSTKSZ to . This 
> would contrast to MINSIGSTKSZ as the lowest value for a platform for all 
> supported processors. Then an application could use sysconf() to query for 
> the maximum size the configuration supports if it wants to use more than 
> that, as a runtime increasable limit.

As I understand it, the concern in glibc is less about runtime
increasability, so much as ABI compatibility with applications compiled
against older headers at a time when the kernel had less state
information to store during a context switch.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: SIGSTKSZ is now a run-time variable

2021-03-09 Thread Eric Blake via austin-group-l at The Open Group
On 3/9/21 1:34 PM, Eric Blake via austin-group-l at The Open Group wrote:
> On 3/9/21 10:14 AM, shwaresyst wrote:
>>
>> To me that looks like a conformance violation and should be reverted. There 
>> is no _SC_SIGSTKSZ defined in  by the standard, to begin with, so 
>> that use of sysconf() is a non-portable extension on its own.
> 
> Portable apps can't use _SC_SIGSTKSZ, but the standard generally permits
> implementations to define further constants.  Then again, re-reading XSH
> 2.2.2:
> 
> " Implementations may add symbols to the headers shown in the following
> table, provided the identifiers for those symbols either:
> 
> Begin with the corresponding reserved prefixes in the table, or
> ..."
> 
> but the table lacks a row for  with _CS_* and _SC_* constants.
>  Looks like you found an independent defect.

Not quite, because later it states "The following identifiers are
reserved regardless of the inclusion of headers: 1. With the exception
of identifiers beginning with the prefix _POSIX_, all identifiers that
begin with an  and either an uppercase letter or another
 are always reserved for any use by the implementation.", so
an implementation can blindly add _SC_* constants at will without
violating the standard.

Still, I opened:
https://www.austingroupbugs.net/view.php?id=1456
to try and add some clarification.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Question regarding gettext behavior on iconv failure

2021-05-03 Thread Eric Blake via austin-group-l at The Open Group
Hello GNU gettext maintainers,

In today's Austin Group meeting, we developed an example of using the
proposed POSIX standardization of gettext() and encountered a situation
where we felt that GNU gettext may have a bug.  For context, the entire
example is at:
https://posix.rhansen.org/p/gettext_split

The example in question set up several .po files and a specific
environment to test various pluralization/transcoding fallbacks, and
concludes with a snippet where a string with an encoding error in
ISO-8859-1 is output in spite of an iconv failure, rather than the
string passed in to ngettext():


n_recipients = 1;
// The following outputs "1 Empfänger" encoded in UTF-8:
printf("%s\n", ngettext("recipient", "recipients", n_recipients));

bind_textdomain_codeset("mail", "ASCII");

n_recipients = 1;
// The following outputs "recipient" with the same encoding as the
"recipient"
// argument to ngettext (remember, the the system is assumed to not
support
// conversion from ISO/IEC 8859-1 to ASCII):
printf("%s\n", ngettext("recipient", "recipients", n_recipients));
// On GNU gettext, "1 Empfänger" is output in ISO-8859-1 here (i.e.
no conversion is done). I think we already agreed on considering this
behavior a bug,

This raises a few questions: does the GNU gettext team agree that this
can be considered a bug, and if so, will a future gettext release behave
differently?  Or if it is intentional and not a bug, can you provide
justification for the behavior as well as tweaks to the proposed
standard wording for gettext requirements and the worked example?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



Re: Interpretation starting for a 30 day review (1440)

2021-10-29 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Oct 30, 2021 at 12:46:55AM +0700, Robert Elz via austin-group-l at The 
Open Group wrote:
> Date:Fri, 29 Oct 2021 10:00:04 -0700
> From:Nick Stoughton 
> Message-ID:  
> 
> 
>   | Just for reference, the C standard says:
> 
> Thanks, it was a little hard to imagine just how they would be
> able to (with a straight face) talk about args to "sh" ...
> 
>   | So I agree, we should change the wording here so that for Issue 7 we only
>   | state what implementations should expect to do when Issue 8 comes out, and
>   | give application developers strong warnings about how to work around the
>   | issues caused by the possible (certain?) loss of the '--' in existing
>   | implementations.
> 
> If there was going to be a new Issue 7 rev, before Issue 8, that would
> perhaps be a plausible approach - but unless something has changed, and
> Issue8 is not to be the next version released, that doesn't really work.

Another thing to consider: if enough implementations fix things NOW to
use "--" in system() and popen(), then by the time we actually DO
release Issue 8, it will already be common enough practice to
standardize it.  But I also agree with your argument that at a bare
minimum, we owe the reader some Rationale text explaining that older
versions of the standard did not require sane behavior for arguments
starting with '-' or '+', and that applications can always space-stuff
their commands to ensure desired behavior regardless of whether the
underlying implementation has Issue7 or Issue8 semantics (if we go
ahead and require "--" in Issue8).

At any rate, I've now filed a glibc bug, so we'll see what other libc
authors think about both the POSIX bug and your reaction about it
being premature to standardize a requirement of "--" (vs. just merely
recommending it and documenting what portable apps must do in the
meantime).

https://sourceware.org/bugzilla/show_bug.cgi?id=28519

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001440]: Calling `system("-some-tool")` fails (although it is a valid `sh` command)

2021-11-01 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Oct 30, 2021 at 08:21:55PM -0400, Wayne Pollock via austin-group-l at 
The Open Group wrote:
> Is it guaranteed that on conforming systems nohup (and friends) must not 
> accept or
> delete the first "--"?  For the example to work, nohup must not discard the 
> "--".
> But might it?

I'm not sure why you claim nohup would not work if it discards "--".

Just because the standard does not require nohup to accept options
does not mean that implementations cannot have options as an
extension.

> 
> Section 1.4 "Utility Description Defaults" of the Introduction states
> "... Default Behavior: When this section is listed as "None.", it means that 
> the
> implementation need not support any options. Standard utilities that do not 
> accept
> options, but that do accept operands, shall recognize "--" as a first 
> argument to be
> discarded. ..."
> 
> And nohup fits that description; its OPTIONS section is listed as "None".

Correct, and that text does not need changing.  As you correctly
quoted, that means that nohup MUST accept and discard an initial "--",
the same as basename (another utility where I have seen the common bug
of handling -- incorrectly in some implementations).  If you want to
invoke another app that may begin with "-", or if you want to ensure
that a later "--" is passed to the utility itself regardless of
whether nohup has the (non-standard) extension of reordering options
after arguments, you can always write:

nohup -- $utility -- $non_option

And a quick test demonstrates that at least GNU Coreutils' nohup is
compliant (it supports long options, which are already an extension to
the standard, but not short options; but it does honor -- for
attempting to execute $utility that may begin with -):

$ POSIXLY_CORRECT=1 nohup -- printf -- abc 2>/dev/null | cat
abc
$ POSIXLY_CORRECT=1 nohup printf -- abc 2>/dev/null | cat
abc
$ nohup --version | head -n1
nohup (GNU coreutils) 8.32
$ nohup -- --version
nohup: ignoring input and appending output to 'nohup.out'
nohup: failed to run command '--version': No such file or directory
$ rm nohup.out
$ 

> Maybe nohup needs to be among the utilities that do not recognize "--".

No. While we are explicit that echo is one of the few apps needing an
exception to not recognize "--", that exception does NOT need to apply
to nohup.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

2022-02-08 Thread Eric Blake via austin-group-l at The Open Group
On Tue, Feb 08, 2022 at 06:53:50AM +0100, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> I'm afraid but some more questions came up on my side:
> 
> 
> 1) POSIX says:
> "The encoded values associated with , , , and
>  shall be invariant across all locales supported by
> the implementation."
> 
> When now, for example,  is encoded as the byte 0x2E ... the
> consequence would be that it had to be 0x2E in all locales and their
> encodings, right?

Yes. And another fallout of that requirement: you cannot have a single
POSIX system supporting both ASCII and EBCDIC locales.  You can have
iconv and dd support for converting files between the two encodings,
but only one of those two encodings can match your current locale (all
syscalls, all filenames, and so forth, are tied to the current
encoding in use by the POSIX locale, whether that encoding be ASCII,
EBCDIC, or something else).  Any means for choosing which of those two
encodings is treated as the basis of the POSIX locale when starting a
subtree of processes that interact as a POSIX environment would be
vendor-specific interfaces outside of POSIX.

> 
> Doesn't that also mean that POSIX effectively forbids UTF16 or UTF32
> and actually any >1-byte fixed-encoding?
> Cause there it would have to be "padded" with 0x00?

Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as
its basis.  Again, iconv and wide-character library calls (such as
wprintf) can support conversion of files into and out of those
encodings, but that is only file contents; all file names, syscalls,
and other aspects of the POSIX environment for cross-process
communication outside of file contents will use multi-byte encodings
where no multi-byte sequence has an embedded 0x00 byte, and NOT wide
character sequences that would represent UTF16 or UTF32 characters
directly.

> 2) When I have a shell script in some encoding, and it contains e.g.:
>   printf '.'
> would POSIX demand that this:
> a) always cause the byte 0x2E to be printed

POSIX states that  will be printed.  If that is the byte
0x2E, then your POSIX locale is probably ASCII-based.  But it is also
possible to have a POSIX conforming environment where the POSIX locale
is EBCDIC based, in which case it would print byte 0x4B, but that
would still be  for all file names and syscalls observable
from that POSIX environment.

> b) print the character 'x' according to the currently set locale, e.g.
>if that was using UTF16, it would print the bytes 0x2e 0x00

It is not possible to have a POSIX locale based on the UTF16 encoding.
So this answer is not possible.  While you can write a file with
characters encoded in UTF16, which when recoded to a multibyte locale
form a shell script, it is only after you use iconv or fscanf or
similar to perform that encoding conversion before it actually becomes
a shell script (since sh is documented as being able to reject files
containing NUL bytes as not being a shell script).  POSIX does not
allow you to execute a file encoded in UTF16 as a shell script.

> c) print the character 'x' according to the locale in which the shell
>parses the script (but there again, if it was UTF16... the bytes
>0x2e 0x00)

The shell is not required to parse UTF16, because the POSIX locale
cannot be based on UTF16.

> d) Would it in some weird encodings like IBM905 cause the byte 0x4B to
>be printed?

If you are running on an IBM machine where the POSIX locale is based
on EBCDIC, then it will indeed print the byte 0x4B.  But it will still
be , as detected by all other processes reached from that
POSIX environment (and that system will necessarily by unable to have
an ASCII or UTF8 encoding in any of its locales; you are back to
having to use an extension outside of POSIX if you want to start a new
subtree of processes based on an ASCII base encoding).

> 
> 3) With respect to the command substitution with trailing newlines
> question:
> 
> Because of (2) ... would it be in any way safer to e.g.
>   printf '\056'
> (octal for . in ASCII/etc.)
> and also strip that off... rather than using '.'?

Actually, it is less portable.  \056 is a particular byte value, but
unless you know your POSIX locale is ASCII-based, you don't know
whether that byte value is , or some other character, and
there are some POSIX-feasible locales where some single-byte
characters (such as 'A') may also appear in a multibyte-character
sequence.

> 
> Especially also with respect to a hypothetical UTF16/32 locale?

There is no such locale.

> 
> 4) Doesn't strictly belong here, but maybe someone knows:
> On my Debian (=> glibc) I was trying this:
> /usr/share/i18n/charmaps$ zgrep "[xX]2[eEfF]" * | grep -Ev 
> '[[:space:]](SOLIDUS|FULL STOP)$'
> 
> i.e. searching for any entries that are 0x2E or 0x2f ( . and / ),
> filtering out any who really are considered as that.
> 
> That gave quite some matches:
> BRF.gz: /x2e BRAILLE PATTERN DOTS-46
> BRF.gz: /x2f BRAILLE PA

Re: [Issue 8 drafts 0001556]: clarify meaning of \n used in a bracket expression in a sed context address or s-command

2022-04-25 Thread Eric Blake via austin-group-l at The Open Group
Adding bug-...@gnu.org into this conversation.

On Mon, Apr 25, 2022 at 02:50:22AM +0200, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> Geoff, I haven't had time yet to look at your updated proposal of
> #1550, not sure whether I manage to do it this night or in the next
> days.
> But I'll definitely reply, so please be a bit more patient. :-)
> 
> 
> However, on thing came to my minds again, which I think needs further
> discussion...
> 
> 
> 
> The current "solution" to a number of previous problems is:
> 
> Inside a bracket expression there cannot be any escape sequences.
> Therefore, there cannot be any \n (in the sense of ) nor any
> \c (in the sense of "un-delimitering" the delimiter character c).
> 
> 
> While this is per se perfectly valid (and solves numerous issues), it
> has one problem:
> 
> (at least) GNU sed breaks it already!
> 
> 
> 
> As you noted yourself in
> https://www.austingroupbugs.net/view.php?id=1556#c5621
> 
> it requires POSIXLY_CORRECT=1 to work as it should.
> 
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> a\b
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> X
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> anb
> $ export POSIXLY_CORRECT=1
> $ printf 'a\\b\n' | sed 's/a[\n]b/X/'
> X
> $ printf 'a\nb\n' | sed 's/a[\n]b/X/'
> a
> b
> $ printf 'a\nb\n' | sed -z 's/a[\n]b/X/'
> a
> b
> $ printf 'anb\n' | sed 's/a[\n]b/X/'
> X
> $ 
> 
> 
> NOT so for GNU's extension of '\s':
> '\s'
>  Matches whitespace characters (spaces and tabs).  Newlines
>  embedded in the pattern/hold spaces will also match...
> (and I assume neither for any similar such extensions):
> 
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> $ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $ export POSIXLY_CORRECT=1
> $ printf 'asb\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a\\b\n' | sed 's/a[\s]b/X/'
> X
> calestyo@heisenberg:~$ printf 'a b\n' | sed 's/a[\s]b/X/'
> a b
> $
> 
> 
> It also works as expected for escaped delimiter characters:
> $ printf 'aDb\n' | sed 'sDa[\D]bDXD'
> X
> $ printf 'a\\b\n' | sed 'sDa[\D]bDXD'
> X
> 
> even when the delimiter char has also special meaning when escaped (as
> with '\s'):
> $ printf 'asb\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a\\b\n' | sed 'ssa[\s]bsXs'
> X
> $ printf 'a b\n' | sed 'ssa[\s]bsXs'
> a b
> 
> 
> (all the above with GNU sed 4.8).
> 
> 
> So the only problematic case seems to be '\n'.
> 
> 
> 
> I don't want to step on anyone's toes... but GNU sed is probably one of
> the (if not the) major implementation of sed, isn't it?
> 
> 
> And regardless of POSIXLY_CORRECT, the standard describes now a
> behaviour (namely that the bracket expression [\n] is the literal
> characters '\' or 'n' and *not* )... which is not shared by a
> major implementation, at least not with its default settings.
> 
> Anyone who reads the standard would assume that [\n] is not a
> . 
> And of course we could just say "well your implementation is not
> compliant" or "look at it's documentation, where it says about
> POSIXLY_CORRECT" ... but that doesn't seem so good to me.
> 
> Usually, implementations extend POSIX rather gracefully, but this is a
> more serious deviation.
> 
> 
> I mean should we just leave it at that?
> 
> Or should we add some hint, e.g. indicating that portable applications
> should not use '\n' but rather 'n\' ... or perhaps even generally place
> '\' last in the bracket expression?
> 
> 
> The best would of course be to get GNU change it's behaviour, though I
> have no idea how likely that is ;-)
> 
> I had tried to reach out to GNU and BusyBox sed maintainers before, and
> while I got replies from BusyBox' I couldn't get in touch with GNU's.
> 
> Is there anyone who's in contact with these people?

The GNU sed developers can be reached at bug-...@gnu.org (per the
output of 'sed --help', and as done in this email).

So if I'm restating your complaint correctly, you are worried that GNU
sed's non-POSIX behavior (what you get by default when POSIXLY_CORRECT
is not set) treats the four-byte sequence '[\n]' in an s-command regex
as a bracket expression for the single character of a literal newline
(that is, interpreting \n as an escape sequence even though it is
inside a bracket expression), instead of as a bracket expression for
either of a literal backslash or literal n; but concur that its
behavior when being POSIX-compliant matches the POSIX rules.

POSIX can't control what GNU sed does when in non-POSIX mode.  But it
can document a recommendation to spell the bracket expression intended
to match either a backslash or an n in the order [n\] to avoid any
potential confusion with [\n] being interpreted as an escape sequence.

Or am I missing something else that you are proposing that either the
Austin Group should do in its documentation efforts, and/or which GNU
sed should do to comply with the recent Austin Group recommendations?

-- 
Eric 

Latest on POSIX efforts to standardize gettext

2022-05-05 Thread Eric Blake via austin-group-l at The Open Group
Hello GNU and Illumos folks,

The Austin Group (those in charge of the POSIX specification) have
been working on a draft to incorporate the gettext(3) family of
functions and related gettext(1) utilities into the next revision of
POSIX (per https://austingroupbugs.net/view.php?id=1122).  After
several months of near-weekly conference calls, the latest draft of
the work has finally reached the point where it is ready for more
thorough analysis by a wider group of readers.  You can view the
current state of the draft here:

https://posix.rhansen.org/p/gettext_draft

In particular, this draft has an action item to me to reach out to you
on the following question (currently found at line 1138 of that
document, or search for "A.I."):

In the msgfmt(1) utility, there is currently a difference between GNU
and Illumos implementations on detecting duplicate msgid strings, and
which command line switch(es) make detection of duplicates possible.
The question is whether GNU msgfmt would be willing to use the current
-c option (--check) have a mode for erroring out on duplicate msgid
strings, or even adding a new command line option (-n appears to be
available, for a mnemonic of 'no dupes') to have the duplicate
detection available without requiring -c.

In addition to answering that question, any review of the rest of the
proposed wording (particularly anything that is still colored and thus
represents edits since the last time we asked for review) is still
appreciated.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Latest on POSIX efforts to standardize gettext

2022-05-09 Thread Eric Blake via austin-group-l at The Open Group
On Thu, May 05, 2022 at 09:31:41AM -0500, Eric Blake via austin-group-l at The 
Open Group wrote:
> Hello GNU and Illumos folks,
> 
> The Austin Group (those in charge of the POSIX specification) have
> been working on a draft to incorporate the gettext(3) family of
> functions and related gettext(1) utilities into the next revision of
> POSIX (per https://austingroupbugs.net/view.php?id=1122).  After
> several months of near-weekly conference calls, the latest draft of
> the work has finally reached the point where it is ready for more
> thorough analysis by a wider group of readers.  You can view the
> current state of the draft here:
> 
> https://posix.rhansen.org/p/gettext_draft

Another question came up today (line 1172 in the draft at the time I
wrote this email).  Given the following test file test.c:

#include 
#include 
int main(){
  printf("%s\n",dgettext("foobar","test"));
}

Running "xgettext test.c", on Solaris, the resulting .po file is
called "foobar.po" and contains the msgid "test". Running it on GNU,
the resulting .po file is called "messages.po" and there is no
indication that the msgid belongs to "foobar". According to the L18nux
specification, the Solaris behavior is intended. Why does GNU xgettext
deviate?

Knowing whether this is considered a bug that future GNU xgettext will
fix, vs. intentional behavior that the standard should purposefully
not constrain, can impact what wording is chosen for the standard here.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: Can struct sockaddr_un.sun_path be a flexible array member?

2022-07-20 Thread Eric Blake via austin-group-l at The Open Group
On Sun, Jul 17, 2022 at 03:46:52PM -0700, Nick Stoughton via austin-group-l at 
The Open Group wrote:
> Note that a flexible array member is not the same thing as a variable
> length array, and although both entered the standard in C99, previous
> versions allowed the FAM to be specified as an array of length 0.
> 
> The C standard notes that:
> > In most situations, the flexible array member is ignored. In particular,
> the size of the structure is as if the flexible array member were omitted
> ...
> and "sizeof" does just that (omits the flexible array member).
> 
> The normative text does not seem to preclude the use of a flexible array
> member but does not specify any mechanism to obtain the size if it were so.
> I believe that it is a bug in the standard that it is not made clearer that
> the implementation should define the size somehow. I know of no
> implementation that uses a flexible array here. Please feel free to submit
> a bug to austingroupbugs.net with this.

Or better yet, help with amending the existing bug to propose the
desired wording changes:

https://www.austingroupbugs.net/view.php?id=561

Based on an earlier meeting, our current thoughts are:

- Add requirement that sun_path be last member of struct sockaddr_un,
and that it have a constant (although unspecified) size rather than
being an open array

- Add application usage to functions dealing with sockname to
recommend memory > sizeof(struct sockaddr_un) preinitialized to 0 when
it is desired to ensure NUL termination

- Leave SUN_LEN out of the standard; we don't want variable-length
sun_path

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001457]: Add readlink(1) utility

2022-07-22 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Jul 22, 2022 at 09:26:45AM +0200, Quentin Rameau via austin-group-l at 
The Open Group wrote:
> Hello,
> 
> > == 
> > https://austingroupbugs.net/view.php?id=1457 
> > == 
> 
> > == 
> > Summary:Add readlink(1) utility
> > == 
> 
> > -nDo not output a trailing 
> > character.
> 
> Out of curiosity, what's a use-case for that?

Good question.  My initial thought was that the construct:

  var=$(readlink -- "$name")

will NOT assign var to the correct contents if $name is a symlink that
resolves to a string containing trailing newlines, as $() would strip
not only the newline added by readlink, but also the newlines from the
link contents.  But using:

  var=$(readlink -n -- "$name")

will not fare any better; it will also strip trailing newlines from
the link content.  The only reliable way to accurately capture the
contents of a symlink in a shell variable is to do something like:

  tmp=$(readlink -n -- "$name"; printf .)
  var=${tmp%.}

at which point the addition of -n doesn't really help, because you
could also do:

  tmp=$(readlink -- "$name"; printf .)
  var=${tmp%?.}

with fewer characters typed.

So the only actual answer I can come up with is "existing practice in
readlink implementations in the wild", where we'd have to ask the
program designers why they thought -n was useful.

[If readlink is implemented as a shell builtin, then you could have an
extension where:

  readlink -v var -n -- "$name"

assigns $var to the full symlink contents, without any extra or
stripped newlines, but such an extension is not what we are proposing
to standardize]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2016/18)/Issue7+TC2 0001457]: Add readlink(1) utility

2022-07-22 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Jul 22, 2022 at 05:04:09PM +0100, Jonathan Wakely wrote:
> On Fri, 22 Jul 2022 at 15:53, Robert Elz via austin-group-l at The
> Open Group  wrote:
> > Aside from that possibility the only reason would seem to be the same
> > as why echo (real ones) have -n (and trashy ones have \c) and why
> > printf(1) needs a \n to print one ... there are times that it is useful
> > to write a partial line to stdout (or wherever) and there's no reason
> > that the output of readlink could not be intended to be a part of such
> > a gradually constructed output line.
> 
> But then shouldn't *every* command that prints output have a -n option?
> 
> If you need to include the output of readlink in gradually constructed
> output you can do what you have to do with other commands:
> 
> printf '%s' "$(readlink foo)"

That strips trailing newlines that may have been important.  The link
contents $'abc' and $'abc\n' are indecipherable under your approach of
a path through $() and printf.  If you are going to output a
constructed filename to stdout, you really DO want:

readlink -n foo && echo /newfile

to produce the output "link/content/newfile" when foo contains
'link/content', and still handle the case where foo's content is
instead something with a trailing newline.

> 
> The fact that echo and printf have that feature means you don't need
> it everywhere.

You don't need it for utilities that are seldom used in generating
partial file names; but for programs like dirname and readlink,
providing a simpler way to use the utility in the context of building
up a larger file name without losing intermediate trailing newlines
that would be eaten by $() is enough of a worry that adding things
like -n to make it more useful was worthwhile to the implementors.
I'm aware that 'dirname -n' is not common implementation practice, but
since 'readlink -n' does appear to be, there's no harm in
standardizing it that way.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000561]: NUL-termination of sun_path in Unix sockets

2022-11-30 Thread Eric Blake via austin-group-l at The Open Group
On Mon, Nov 28, 2022 at 07:30:36PM +0100, Steffen Nurpmeso via austin-group-l 
at The Open Group wrote:
> Austin Group Bug Tracker wrote in
>  :
>  ...
>  |https://austingroupbugs.net/view.php?id=561 
>  ...
>  |-- 
>  | (0006085) geoffclare (manager) - 2022-11-28 16:24
>  | https://austingroupbugs.net/view.php?id=561#c6085 
>  |-- 
>  ...
>  |char sun_path[size]   Socket pathname
>  |storage.
>  ...
>  |[.] However, because sun_path is required to be the
>  |last member of the struct, an application can deduce the size by using
>  |sizeof(struct sockaddr_un) - offsetof(struct sockaddr_un,
>  |sun_path).
> 
> I am glued to old habits, but given it is the last field and of
> a known fixed size sizeof(NAME.sun_path) should be all that is
> necessary.  (It definitely is in practice.)
> (And all this different to SUN_LEN(), of course.)

Two comments in response:

First, I chose that wording because 'sizeof(struct
sockaddr_un.sun_path)' doesn't compile.  You are right that 'sizeof
NAME.sun_path' does compile, if NAME is an expression of type struct
sockaddr_un, but the sentence becomes longer to introduce some object
named NAME of the correct type just to get to the shorter sizeof
expression.  However, we can make that edit if it makes sense.

Second, given alignment issues, a choice of an odd size coupled with
other members that require even alignment could permit an
implementation where sizeof(struct sockaddr_un) > offsetof(struct
sockaddr_un, sun_path) + sizeof(NAME.sun_path) due to padding bytes
added for alignment reasons.  I don't know of any such implementations
in practice (the choice of 92, 104, and 108 as the most common sizes
tends to be so that the overall struct sockaddr_un has a size of 128
bytes, which is a nice power-of-two boundary).  Then again,
intentionally forcing struct sockaddr_un to have a padding byte after
sun_path might be an implementation's way of guaranteeing that it can
handle a NUL byte even if the application didn't pass one in.
Therefore, do we need to modify the wording in this proposal to ensure
that struct sockaddr_un is not allowed to have padding bytes after
sun_path to match existing practice?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [1003.1(2008)/Issue 7 0000561]: NUL-termination of sun_path in Unix sockets

2022-11-30 Thread Eric Blake via austin-group-l at The Open Group
On Wed, Nov 30, 2022 at 08:54:03AM -0600, Eric Blake via austin-group-l at The 
Open Group wrote:
> >  ...
> >  |https://austingroupbugs.net/view.php?id=561 

> 
> First, I chose that wording because 'sizeof(struct
> sockaddr_un.sun_path)' doesn't compile.  You are right that 'sizeof
> NAME.sun_path' does compile, if NAME is an expression of type struct
> sockaddr_un, but the sentence becomes longer to introduce some object
> named NAME of the correct type just to get to the shorter sizeof
> expression.  However, we can make that edit if it makes sense.

Having written that, I did test that 'sizeof(((struct
sockaddr_un*)0)->sun_path)' compiles with gcc, although I'm less
certain of whether the C standard permits that (or even if that
permission has changed over time) - the expression argument to sizeof
is unevaluated, which counters the argument that you can't normally
evaluate a dereference of a NULL pointer.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Austin Group questions on iconv()

2023-03-09 Thread Eric Blake via austin-group-l at The Open Group
In today's Austin Group meeting, the folks discussing POSIX had a
question for Bruno and/or anyone else with an idea on how the
standards should approach a difference in behavior between Solaris and
GNU iconv() implementations.

For context, today's meeting minutes:
https://posix.rhansen.org/p/2023-03-09 around line 1635

and the bugs leading to the question:

https://austingroupbugs.net/view.php?id=1635
 "0001635: iconv: please be more explicit in input-not-convertible case"
 still open - iconv() resulting in EILSEQ not because of input
 encoding error but because of output being unable to encode the
 transliteration

https://austingroupbugs.net/view.php?id=1007
 "0001007: iconv function not allowed to fail to convert valid sequences"
 resolved at https://austingroupbugs.net/view.php?id=1007#c3330,
 standardizing the //IGNORE, //TRANSLIT, and //NON_IDENTICAL_DISCARD
 modifiers

It seems that bug 1635 is saying that the Solaris implementation
provides a conversion that application writers can use to get reliable
output but does not provide some desired features, and the standard
should change to acknowledge that the GNU implementation provides some
of those desired features.  However, the GNU implementation includes
some ambiguities that make it unreliable.  It seems to ask us to
change the standard to allow a modified version of the GNU iconv()
function that could be reliably interpreted by an appication writer.
For example, overloading EILSEQ to mean that there was an invalid
character in the input stream or that there was no transliteration
available in the output codeset to convert that input character makes
it impossible for an application to determine which of those two
problems caused iconv() to fail.

Can we get an explanation on how an application writer is supposed to
write code to reliably use the iconv() in GNU libc, given the above
example?  Can we get help in identifying exactly what changes need to
be made to POSIX (after bugid:1007 has been integrated) to allow GNU
behavior and get reliable results without breaking applications that
currently work with the Solaris iconv() interface.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] sockaddr.3type: Document that sockaddr_storage is the API to be used

2023-04-06 Thread Eric Blake via austin-group-l at The Open Group
On Thu, Apr 06, 2023 at 02:05:15PM -0400, Zack Weinberg wrote:
> On Thu, Apr 6, 2023, at 12:31 PM, Alejandro Colomar via Libc-alpha wrote:
> > On 4/6/23 18:24, Eric Blake wrote:
> >> here's the updated wording that the Austin Group tried today (and we
> >> plan on starting a 30-day interpretation feedback window if there are
> >> still adjustments to be made to the POSIX wording):
> >>
> >> https://austingroupbugs.net/view.php?id=1641#c6255
> >
> > Thanks!  That wording (both paragraphs) LGTM.
> 
> If I could suggest an additional change, the focus on aliasing
> _diagnostics_ rather misses the point IMHO.  We don't just want the
> compiler to _not complain_ about accesses to sa_family_t, we want it to
> treat the accesses as _legitimate_.  So, instead of
> 
> # Additionally, the structures shall be defined in such a way that
> # these casts do not cause the compiler to produce diagnostics about
> # aliasing issues in accessing the sa_family_t member of these
> # structures when compiling conforming application (xref to XBD section
> # 2.2) source files.
> 
> may I suggest wording along the lines of
> 
> # Additionally, the structures shall be defined in such a way that
> # the compiler treats an access to the stored value of the sa_family_t
> # member of any of these structures, via an lvalue expression whose type
> # involves any other one of these structures, as permissible, despite the
> # more restrictive rules listed in ISO C section 6.5p7.

I like it as an improvement; I've added your suggestion to the POSIX
bug report as one of the comments received during the 30-day
interpretation window, to see what the other standards developers
think.

Since Issue 7 is tied to C99, and Issue 8 will be tied to C17, both of
which use the same section number despite being a different edition of
the C standard, being that specific may work.  Or, we might try
something focusing more on wording instead of document location, as
in:

Additionally, the structures shall be defined in such a way that the
compiler treats an access to the stored value of the sa_family_t
member of any of these structures, via an lvalue expression whose type
involves any other one of these structures, as permissible even if the
types involved would not otherwise be deemed compatible with the
effective type of the object ultimately being accessed.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [PATCH] sockaddr.3type: Document that sockaddr_storage is the API to be used

2023-04-21 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Apr 21, 2023 at 05:00:14PM +0200, Alejandro Colomar wrote:
> > 
> > The wording I see in 
> > doesn't seem to cover the case of aliasing a sockaddr_storage as a
> > protocol-specific address for setting other members.
> > 
> > Aliasing rules don't allow one to declare an object of type
> > sockaddr_storage and then fill the structure as if it were another
> > structure, even if alignment and size are correct.  We would need
> > some wording that says something like:
> > 
> > When a pointer to a sockaddr_storage structure is first aliased as a
> > pointer to a protocol-specific address structure, the effective type
> > of the object will be set to the protocol-specific structure.

I'll add that as a comment to the Austin Group page; it seems like a
reasonable statement of intent (POSIX already says that struct
sockaddr_storage is sufficiently sized and aligned; all that remains
is for the compiler to be aware that we intend to use a
more-appropriate effective type once we have the storage allocated).

> > 
> > This is similar to what happens when malloc(3) is assigned to a
> > non-character type.  That's a big hammer, but it does the job.  Maybe
> > we would need some looser language?  I CCd GCC, in case they have
> > concerns about this wording.
> > 
> > Cheers,
> > Alex
> > 
> >>
> >> I quite like this way of putting it.  It subsumes both what I wrote and 
> >> the related potential headache with deciding whether the sa_family_t 
> >> field is considered an object or just a range of bytes within a larger 
> >> object.
> >>
> >> zw
> > 
> 
> For the man pages, I've rewritten it to the following:
> 
> 
> $ git diff
> diff --git a/man3type/sockaddr.3type b/man3type/sockaddr.3type
> index 2fdf56c59..e610aa0f5 100644
> --- a/man3type/sockaddr.3type
> +++ b/man3type/sockaddr.3type
> @@ -117,6 +117,14 @@ .SH HISTORY
>  was invented by POSIX.
>  See also
>  .BR accept (2).
> +.PP
> +These structures were invented before modern ISO C strict-aliasing rules.
> +If aliasing rules are applied strictly,
> +these structures would be impossible to use

Maybe "extremely difficult" instead of "impossible" to use (if I
understand this thread correctly, it is possible to memcpy() from one
struct into different storage of a different effective type where the
memcpy()'s intermediate aliasing through char* avoids the UB).

> +without invoking Undefined Behavior (UB).
> +POSIX Issue 8 will fix this by requiring that implementations
> +make sure that these structures
> +can be safely used as they were designed.
>  .SH NOTES
>  .I socklen_t
>  is also defined in
> 
> 
> I guess this is simple enough that it should work as documentation.

It seems fine from my perspective.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: encoding question

2023-07-18 Thread Eric Blake via austin-group-l at The Open Group
On Sat, Jul 15, 2023 at 10:41:49PM +, Thorsten Glaser via austin-group-l at 
The Open Group wrote:
> Hi,
> 
> I get that the POSIX locale must be a single-byte character locale
> where all 256 octets are characters. I’ve got a question about the
> wide character representation.
> 
> Assuming my POSIX locale uses ASCII as encoding, I’ve got the whole
> portable character set (and then some) in the first 128 codepoints,
> which have the ASCII code as both octet SBCS value and wchar_t value.
> In this scenario, is it permissible to map the other 128 codepoints
> “high” i.e. to wchar_t values > 0x0100?

You're not the first to ask this question.  Here's a link to a
proposed patch to glibc on the same topic just this month, after
noting that musl has already dealt with it:

https://sourceware.org/pipermail/libc-alpha/2023-July/149588.html
https://sourceware.org/pipermail/libc-alpha/2023-July/150021.html
https://www.openwall.com/lists/musl/2022/11/10/2

The conclusion in those links appears to be that it is compliant to
have the 8-bit characters map to wchar_t codepoints that are not valid
Unicode characters, but which are distinct enough to preserve all
other properties needed to treat the POSIX locale as a single-byte
locale with 256 "characters" and proper collation sequence without
encoding errors.  Whether the mapping is to the 0xdcXX or 0xdfXX range
of reserved codepoints in Unicode is a matter of implementation
choice; both choices exist in implementations already out there.

> 
> I’m reading the standard as yes, but not asking already landed me
> in trouble in the past so I’d rather…

That's a wise course of action.  And while maybe the standard could
make this easier, the fact that there are already two commonly chosen
ranges already in play is not going to make it easy to mandate a
specific mapping.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



RFC: changing printf(1) behavior on %b

2023-08-31 Thread Eric Blake via austin-group-l at The Open Group
In today's Austin Group call, we discussed the fact that printf(1) has
mandated behavior for %b (escape sequence processing similar to XSI
echo) that will eventually conflict with C2x's desire to introduce %b
to printf(3) (to produce 0b000... binary literals).

For POSIX Issue 8, we plan to mark the current semantics of %b in
printf(1) as obsolescent (it would continue to work, because Issue 8
targets C17 where there is no conflict with C2x), but with a Future
Directions note that for Issue 9, we could remove %b entirely, or
(more likely) make %b output binary literals just like C.  But that
raises the question of whether the escape-sequence processing
semantics of %b should still remain available under the standard,
under some other spelling, since relying on XSI echo is still not
portable.

One of the observations made in the meeting was that currently, both
the POSIX spec for printf(1) as seen at [1], and the POSIX and C
standard (including the upcoming C2x standard) for printf(3) as seen
at [3] state that both the ' and # flag modifiers are currently
undefined when applied to %s.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
"The format operand shall be used as the format string described in
XBD File Format Notation[2] with the following exceptions:..."

[2] 
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html#tag_05
"The flag characters and their meanings are: ...
# The value shall be converted to an alternative form. For c, d, i, u,
  and s conversion specifiers, the behavior is undefined.
[and no mention of ']"

[3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/printf.html
"The flag characters and their meanings are:
' [CX] [Option Start] (The .) The integer portion of the
  result of a decimal conversion ( %i, %d, %u, %f, %F, %g, or %G )
  shall be formatted with thousands' grouping characters. For other
  conversions the behavior is undefined. The non-monetary grouping
  character is used. [Option End]
...
# Specifies that the value is to be converted to an alternative
  form. For o conversion, it shall increase the precision, if and only
  if necessary, to force the first digit of the result to be a zero
  (if the value and precision are both 0, a single 0 is printed). For
  x or X conversion specifiers, a non-zero result shall have 0x (or
  0X) prefixed to it. For a, A, e, E, f, F, g, and G conversion
  specifiers, the result shall always contain a radix character, even
  if no digits follow the radix character. Without this flag, a radix
  character appears in the result of these conversions only if a digit
  follows it. For g and G conversion specifiers, trailing zeros shall
  not be removed from the result as they normally are. For other
  conversion specifiers, the behavior is undefined."

Thus, it appears that both %#s and %'s are available for use for
future standardization.  Typing-wise, %#s as a synonym for %b is
probably going to be easier (less shell escaping needed).  Is there
any interest in a patch to coreutils or bash that would add such a
synonym, to make it easier to leave that functionality in place for
POSIX Issue 9 even when %b is repurposed to align with C2x?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-08-31 Thread Eric Blake via austin-group-l at The Open Group
On Thu, Aug 31, 2023 at 03:10:58PM -0400, Chet Ramey wrote:
> On 8/31/23 11:35 AM, Eric Blake wrote:
> > In today's Austin Group call, we discussed the fact that printf(1) has
> > mandated behavior for %b (escape sequence processing similar to XSI
> > echo) that will eventually conflict with C2x's desire to introduce %b
> > to printf(3) (to produce 0b000... binary literals).
> > 
> > For POSIX Issue 8, we plan to mark the current semantics of %b in
> > printf(1) as obsolescent (it would continue to work, because Issue 8
> > targets C17 where there is no conflict with C2x), but with a Future
> > Directions note that for Issue 9, we could remove %b entirely, or
> > (more likely) make %b output binary literals just like C.
> 
> I doubt I'd ever remove %b, even in posix mode -- it's already been there
> for 25 years.

But the longer that printf(3) supports "%b" to output binary values,
the more surprised new shell coders will be that printf(1) %b does not
behave the same.  What's more, other languages have already started
using %b for binary output (python, for example), so it is definitely
gaining in mindshare.

That said, I also agree with your desire to keep the functionality in
place.  The current POSIX says that %b was added so that on a non-XSI
system, you could do:

my_echo() {
  printf %b\\n "$*"
}

and then call my_echo everywhere that a script used to depend on XSI
echo (perhaps by 'alias echo=my_echo' with aliases enabled), for a
much quicker portability hack than a tedious search-and-replace of
every echo call that requires manual inspection of its arguments for
translation of any XSI escape sequences into printf format
specifications.  In particular, code like [var='...\c'; echo "$var"]
cannot be changed to use printf by a mere s/echo/printf %s\\n/.  Thus,
when printf was invented and standardized for the shell, the solution
at the time was to create [printf %b\\n "$var"] as a drop-in
replacement for XSI [echo "$var"], even for platforms without XSI
echo.

Nowadays, I personally have not seen very many scripts like this in
the wild (for example, autoconf scripts prefer to directly use printf,
rather than trying to shoe-horn behavior into echo).  But assuming
such legacy scripts still exist, it is still much easier to rewrite
just the my_echo wrapper to now use %#s\\n instead of %b\\n, than it
would be to find every callsite of my_echo.

Bash already has shopt -s xpg_echo; I could easily see this being a
case where you toggle between the old or new behavior of %b (while
keeping %#s always at the old behavior) by either this or some other
shopt in bash, so that newer script writers that want binary output
for %b can do so with one setting, while scripts that must continue to
run under old semantics can likewise do so.

> 
> > But that
> > raises the question of whether the escape-sequence processing
> > semantics of %b should still remain available under the standard,
> > under some other spelling, since relying on XSI echo is still not
> > portable.
> > 
> > One of the observations made in the meeting was that currently, both
> > the POSIX spec for printf(1) as seen at [1], and the POSIX and C
> > standard (including the upcoming C2x standard) for printf(3) as seen
> > at [3] state that both the ' and # flag modifiers are currently
> > undefined when applied to %s.
> 
> Neither one is a very good choice, but `#' is the better one. It at least
> has a passing resemblence to the desired functionality.

Indeed, that's what the Austin Group settled on today after I first
wrote my initial email, and what I wrote up in a patch to GNU
Coreutils (https://debbugs.gnu.org/65659)

> 
> Why not standardize another character, like %B? I suppose I'll have to look
> at the etherpad for the discussion. I think that came up on the mailing
> list, but I can't remember the details.

Yes, https://austingroupbugs.net/view.php?id=1771 has a good
discussion of the various ideas.

%B is out for the same reason as %b: although the current C2x draft
wording says that % is reserved for implementation use, other
than [AEFGX] which already have a history of use by C (as it was, when
C99 added %A, that caused problems for some folks), it goes on to
_highly_ encourage any implementation that adds %b for "0b0" binary
output also add %B for "0B0" binary output (to match the x/X
dichotomy).  Burning %B to retain the old behavior while repurposing
%b to output lower-case binary values is thus a non-starter, while
burning %#s (which C says is undefined) felt nicer.

The Austin Group also felt that standardizing bash's behavior of %q/%Q
for outputting quoted text, while too late for Issue 8, has a good
chance of success, even though C says %q is reserved for
standardization by C. Our reasoning there is that lots of libc over
the years have used %qi as a synonym for %lli, and C would be foolish
to burn %q for anything that does not match those semantics at the C
language level; which means it will likely never be claimed by C and
thus 

Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-09-01 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Sep 01, 2023 at 08:59:19AM +0100, Stephane Chazelas wrote:
> 2023-08-31 15:02:22 -0500, Eric Blake via austin-group-l at The Open Group:
> [...]
> > The current POSIX says that %b was added so that on a non-XSI
> > system, you could do:
> > 
> > my_echo() {
> >   printf %b\\n "$*"
> > }
> 
> That is dependant on the current value of $IFS. You'd need:
> 
> xsi_echo() (
>   IFS=' '
>   printf '%b\n' "$*"
> )

Let's read the standard in context (Issue 8 draft 3 page 2793 line 92595):

"
The printf utility can be used portably to emulate any of the traditional 
behaviors of the echo
utility as follows (assuming that IFS has its standard value or is unset):
• The historic System V echo and the requirements on XSI implementations in 
this volume of
  POSIX.1-202x are equivalent to:
printf "%b\n" "$*"
"

So yes, the standard does mention the requirement to have a sane IFS,
and I failed to include that in my one-off implementation of
my_echo().  Thank you for pointing out a more robust version.

> 
> Or the other alternatives listed at
> https://unix.stackexchange.com/questions/65803/why-is-printf-better-than-echo/65819#65819
> 
> [...]
> > Bash already has shopt -s xpg_echo
> 
> Note that in bash, you need both
> 
> shopt -s xpg_echo
> set -o posix
> 
> To get a XSI echo. Without the latter, options are still
> recognised. You can get a XSI echo without those options with:
> 
> xsi_echo() {
>   local IFS=' ' -
>   set +o posix
>   echo -e "$*\n\c"
> }
> 
> The addition of those \n\c (noop) avoids arguments being treated as
> options if they start with -.

As an extension, Bash (and Coreutils) happen to honor \c always, and
not just for %b.  But POSIX only requires \c handling for %b.

And while Issue 8 has taken steps to allow implementations to support
'echo -e', it is still not standardized behavior; so your xsi_echo()
is bash-specific (which is not necessarily a problem, as long as you
are aware it is not portable).

> [...]
> > The Austin Group also felt that standardizing bash's behavior of %q/%Q
> > for outputting quoted text, while too late for Issue 8, has a good
> > chance of success, even though C says %q is reserved for
> > standardization by C. Our reasoning there is that lots of libc over
> > the years have used %qi as a synonym for %lli, and C would be foolish
> > to burn %q for anything that does not match those semantics at the C
> > language level; which means it will likely never be claimed by C and
> > thus free for use by shell in the way that bash has already done.
> [...]
> 
> Note that %q is from ksh93, not bash and is not portable across
> implementations and with most including bash's gives an output
> that is not safe for reinput in arbitrary locales (as it uses
> $'...' in some cases), not sure  it's a good idea to add it to
> the standard, or at least it should come with fat warnings about
> the risk in using it.

%q is NOT being added to Issue 8, but $'...' is.  Bug 1771 asked if %q
could be added to Issue 8, but it came it past the deadline for
feature requests, so the best we could do is add a FUTURE DIRECTIONS
blurb that mentions the idea.  But since FUTURE DIRECTIONS is
non-normative, we can always change our mind in Issue 9 and delete
that text if it turns out we can't get consensus to standardize some
form of %q/%Q after all.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: bug#65659: RFC: changing printf(1) behavior on %b

2023-09-01 Thread Eric Blake via austin-group-l at The Open Group
On Fri, Sep 01, 2023 at 07:19:13AM +0200, Phi Debian wrote:
> Well after reading yet another thread regarding libc_printf() I got to
> admit that even %B is crossed out, (Yet already choosen by ksh93)
> 
> The other thread also speak about libc_printf() documentting %# as
> undefined for things other than  a, A, e, E, f, F, g, and G, yet the same
> thread also talk about a A comming late (citing C99) in the dance, meaning
> what is undefined today become defined tomorow, so %#b is no safer.
>

Caution: The proposal here is for %#s (an alternative string), not %#b
(which C2x wants to be similar to %#x, in that it outputs a '0b'
prefix for all values except bare '0').

Yes, there is a slight risk that C may decide to define %#s.  But as
the Austin Group includes a member of WG14, we are able to advise the
C committee that such an addition is not wise.

> My guess is that printf(1) is now doomed to follow its route, keep its old
> format exception, and then may be implement something like c_printf like
> printf but the format string follow libc semantic, or may be a -C option to
> printf(1)...

Adding an option to printf is also a possibility, if there is
wide-spread implementation practice to standardize.  If someone wants
to implement 'printf -C' right now, that could help feed such a future
standardization.  But it is somewhat orthogonal to the request in this
thread, which is how to allow users to still access the old %b
behavior even if %b gets repurposed in the future; if we can get
multiple implementations to add a %#s alias now, it makes the future
decisions easier (even if it is too late for Issue 8 to add any new
features, or for that matter, to make any normative changes other than
marking %b obsolescent as a way to be able to revisit it in the future
for Issue 9).


> 
> Well in all case %b can not change semantic in the bash script, since it is
> there for so long, even if it depart from python, perl, libc, it is
> unfortunate but that's the way it is, nobody want a semantic change, and on
> next routers update, see the all internet falling appart :-)

How many scripts in the wild actually use %b, though?  And if there
are such scripts, anything we can do to make it easy to do a drop-in
replacement that still preserves the old behavior (such as changing %b
to %#s) is going to be easier to audit than the only other
currently-portable alternative of actually analyzing the string to see
if it uses any octal or \c escapes that have to be re-written to
portably function as a printf format argument.

POSIX is not mandating %#s at this time, so much as suggesting that if
implementations are willing to implement it now, it will make Issue 9
easier to reason about.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: Recommendation for POSIX ed consideration

2023-12-11 Thread Eric Blake via austin-group-l at The Open Group
Hello Andrew, I'm forwarding your message on to the full Austin Group.

On Sun, Dec 10, 2023 at 11:37:40PM -0500, Andrew L. Moore wrote:
> Hi,
> I am the author of the original GNU ed and maintain an alternative (and I
> might add, much more robust) version at github.com/slewsys/ed.
> 
> One thing that I'd love to see the POSIX committee explore is the exit
> status of ed.  Per the standard:
> 
> EXIT STATUS
> 
> The following exit values shall be returned:
> 
>  0.  Successful completion without any file or command errors.
>  >0.  An error occurred.
> 
> The problem with this behavior is that, in interactive use, it common to
> make errors, correct them and then write the corrected file.  But by exiting
> with an error, even after successfully writing, this prevents ed from being
> used as the editor for many utilties, which abort when the editor exits with
> a non-zero error code.
> 
> In the version of GNU ed handed over to Antonio, the behavior was that after
> a successful write, the error status is reset to zero.  This had no impact
> on traditional scripting and merely allowed ed to be much more friendly,
> e.g., for writing git commits. Unfortunately, Antonio updated GNU ed at some
> point to follow POSIX, which is sub-optimal.
> -AM
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: Re: Questions on strftime vs. POSIX

2024-02-07 Thread Eric Blake via austin-group-l at The Open Group
Widening the scope of this conversation, with Paul's permission.

Context for the Open Group readers: per my Action Item from Monday's
meeting, I emailed Paul regarding
https://austingroupbugs.net/bug_view_page.php?bug_id=1797

On Mon, Feb 05, 2024 at 10:51:34AM -0800, Paul Eggert wrote:
> On 2024-02-05 08:15, Eric Blake wrote:
> 
> > Did you consider the effect of the change on applications that
> > populate struct tm directly (and don't currently set tm_gmtoff, except
> > perhaps by zeroing the structure)?
> 
> Yes. Very few apps do that. (I looked for some in the GNU code I help
> maintain, and found none.) They are greatly outnumbered by the applications
> that call localtime/localtime_r/mktime/gmtime/gmtime_r/etc. and pass the
> result to strftime, which is what this bug report is about.
> 
> 
> > Does the latest tzdata code only use tm_gmtoff in the rare cases when
> > it is necessary for disambiguation, or is it always used (overriding
> > the timezone data)?  The bug description implies the former, but the
> > desired action would allow the latter.
> 
> The former. That is, TZDB 2024a strftime looks only at tm_gmtoff, tm_year,
> tm_mon, tm_day, tm_hour, tm_min, and tm_sec to determine %s, because that's
> all you need.
> 
> The desired action allows either the TZDB behavior, or the glibc behavior
> which if I recall consults tm_gmtoff only when tm_isdst is ambiguous. The
> TZDB behavior is technically better than the glibc behavior for three
> reasons: (1) it removes a multithreading bottleneck, (2) even in a
> single-threaded platform it's faster because mktime is slower than using
> tm_gmtoff, and (3) when user code mistakenly calls gmtime and then strftime
> then %s does what the user expects. The bug report that caused TZDB to
> behave this way was about (3), but (1) and (2) also play a part.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: Questions on strftime vs. POSIX

2024-02-12 Thread Eric Blake via austin-group-l at The Open Group
html

Eric Blake


On Wed, Feb 07, 2024 at 08:14:39AM -0600, Eric Blake via austin-group-l at The 
Open Group wrote:
> Widening the scope of this conversation, with Paul's permission.
> 
> Context for the Open Group readers: per my Action Item from Monday's
> meeting, I emailed Paul regarding
> https://austingroupbugs.net/bug_view_page.php?bug_id=1797
> 
> On Mon, Feb 05, 2024 at 10:51:34AM -0800, Paul Eggert wrote:
> > On 2024-02-05 08:15, Eric Blake wrote:
> > 
> > > Did you consider the effect of the change on applications that
> > > populate struct tm directly (and don't currently set tm_gmtoff, except
> > > perhaps by zeroing the structure)?
> > 
> > Yes. Very few apps do that. (I looked for some in the GNU code I help
> > maintain, and found none.) They are greatly outnumbered by the applications
> > that call localtime/localtime_r/mktime/gmtime/gmtime_r/etc. and pass the
> > result to strftime, which is what this bug report is about.
> > 
> > 
> > > Does the latest tzdata code only use tm_gmtoff in the rare cases when
> > > it is necessary for disambiguation, or is it always used (overriding
> > > the timezone data)?  The bug description implies the former, but the
> > > desired action would allow the latter.
> > 
> > The former. That is, TZDB 2024a strftime looks only at tm_gmtoff, tm_year,
> > tm_mon, tm_day, tm_hour, tm_min, and tm_sec to determine %s, because that's
> > all you need.
> > 
> > The desired action allows either the TZDB behavior, or the glibc behavior
> > which if I recall consults tm_gmtoff only when tm_isdst is ambiguous. The
> > TZDB behavior is technically better than the glibc behavior for three
> > reasons: (1) it removes a multithreading bottleneck, (2) even in a
> > single-threaded platform it's faster because mktime is slower than using
> > tm_gmtoff, and (3) when user code mistakenly calls gmtime and then strftime
> > then %s does what the user expects. The bug report that caused TZDB to
> > behave this way was about (3), but (1) and (2) also play a part.
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.
> Virtualization:  qemu.org | libguestfs.org
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



Re: [Issue 8 drafts 0001798]: Must posix_getdents remember file offsets across exec?

2024-02-15 Thread Eric Blake via austin-group-l at The Open Group
Adding in Corinna Vinshcen, one of the Cygwin developers.  She had
problems trying to post directly on the bug page, so we can use email
replies and summarize the results back to the bug.

On Mon, Jan 22, 2024 at 03:30:20PM +, Austin Group Bug Tracker via 
austin-group-l at The Open Group wrote:
> 
> A NOTE has been added to this issue. 
> == 
> https://austingroupbugs.net/view.php?id=1798 
> == 
> Reported By:eblake
> Assigned To:
> == 
> Project:Issue 8 drafts
> Issue ID:   1798
> Category:   System Interfaces
> Type:   Clarification Requested
> Severity:   Objection
> Priority:   normal
> Status: New
> Name:   Eric Blake 
> Organization:   Red Hat 
> User Reference: ebb.posix_getdents 
> Section:XSH posix_getdents 
> Page Number:1567 
> Line Number:52609 
> Final Accepted Text: 
> == 
> Date Submitted: 2024-01-22 15:13 UTC
> Last Modified:  2024-01-22 15:30 UTC
> == 
> Summary:Must posix_getdents remember file offsets across
> exec?
> == 
> 
> -- 
>  (0006632) eblake (manager) - 2024-01-22 15:30
>  https://austingroupbugs.net/view.php?id=1798#c6632 
> -- 
> Correction - I'm told that the attempted Cygwin implementation also has
> problems after dup(); it is unclear whether the states should be linked
> (reading an entry on one fd, grabbing its offset, then using the other fd
> to read entries, it is unclear whether the second fd starts reading from
> the point where the fd was at the time of dup() or at the shared point
> reached by the first fd, and whether the second fd can safely lseek() to
> the offset read by the first fd).  Easiest would be to state that dup() has
> the same limitations as fork()/exec - namely, that any mid-stream directory
> traversal in either side of the split is unspecified, and the only portable
> thing is to start a new traversal by lseek'ing back to 0 (at which point,
> the implementation no longer has to worry about sharing a half-read DIR*
> across fd copies or processes). 
> 
> Issue History 
> Date ModifiedUsername   FieldChange   
> == 
> 2024-01-22 15:13 eblake New Issue
> 2024-01-22 15:13 eblake Name  => Eric Blake  
> 2024-01-22 15:13 eblake Organization  => Red Hat 
> 2024-01-22 15:13 eblake User Reference=> 
> ebb.posix_getdents
> 2024-01-22 15:13 eblake Section   => XSH 
> posix_getdents
> 2024-01-22 15:13 eblake Page Number   => 1567
> 2024-01-22 15:13 eblake Line Number   => 52609   
> 2024-01-22 15:30 eblake Note Added: 0006632  
> ==
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org