Re: type errors, command length limits, and Awk (was: portability of xargs)

2022-02-15 Thread Dan Kegel
FWIW, commandline length limits are a real thing, I've run into them
with Make, CMake, and Meson.
I did some work to help address them in Meson, see e.g.
https://github.com/mesonbuild/meson/issues/7212

And just for fun, here's a vaguely related changelog entry from long
ago, back when things were much worse:

Tue Jun  8 15:24:14 1993  Paul Eggert  (egg...@twinsun.com)
* inp.c (plan_a): Check that RCS and working files are not the
same.  This check is needed on hosts that do not report file
name length limits and have short limits.

:-)



Re: type errors, command length limits, and Awk (was: portability of xargs)

2022-02-15 Thread Mike Frysinger
On 15 Feb 2022 21:17, Jacob Bachmeyer wrote:
> Mike Frysinger wrote:
> > context: https://bugs.gnu.org/53340
> >   
> Looking at the highlighted line in the context:

thanks for getting into the weeds with me

> > >   echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
> It seems that the problem is that am__base_list expects ListOf/File (and 
> produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob.  
> This works in the usual case because the shell implicitly converts Glob 
> -> ListOf/File and implicitly flattens argument lists, but results in 
> the overall command line being longer than expected if the globs expand 
> to more filenames than expected, as described there.
> 
> It seems that the proper solution to the problem at hand is to have 
> am__pep3147_tweak expand globs itself somehow and thus provide 
> ListOf/File as am__base_list expects.
> 
> Do I misunderstand?  Is there some other use for xargs?

if i did not care about double expansion, this might work.  the pipeline
quoted here handles the arguments correctly (other than whitespace splitting
on the initial input, but that's a much bigger task) before passing them to
the rest of the pipeline.  so the full context:

  echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
  while read files; do \
$(am__uninstall_files_from_dir) || st=$$?; \
  done || exit $$?; \
...
am__uninstall_files_from_dir = { \
  test -z "$$files" \
|| { test ! -d "$$dir" && test ! -f "$$dir" && test ! -r "$$dir"; } \
|| { echo " ( cd '$$dir' && rm -f" $$files ")"; \
 $(am__cd) "$$dir" && rm -f $$files; }; \
  }

leveraging xargs would allow me to maintain a single shell expansion.
the pathological situation being:
  bar.py
  __pycache__/
bar.pyc
bar*.pyc
bar**.pyc

py_files="bar.py" which turns into "__pycache__/bar*.pyc" by the pipeline,
and then am__uninstall_files_from_dir will expand it when calling `rm -f`.

if the pipeline expanded the glob, it would be:
  __pycache__/bar.pyc __pycache__/bar*.pyc __pycache__/bar**.pyc
and then when calling rm, those would expand a 2nd time.

i would have to change how the pipeline outputs the list of files such that
the final subshell could safely consume & expand.  since this is portable
shell, i don't have access to arrays & fancy things like readarray.  if the
pipeline switched to newline delimiting, and i dropped $(am__base_list), i
could use positionals to construct an array and safely expand that.  but i
strongly suspect that it's not going to be as performant, and i might as
well just run `rm` once per file :x.

  echo "$$py_files" | $(am__pep3147_tweak) | \
  ( set --
while read file; do
  set -- "$@" "$file"
  if test $# -ge 40; then
rm -f "$@"
set --
  fi
done
if test $# -gt 0; then
  rm -f "$@"
fi
  )

which at this point i've written `xargs -n40`, but not as fast :p.

> > automake jumps through some hoops to try and limit the length of generated
> > command lines, like deleting output objects in a non-recursive build.  it's
> > not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
> > and assumes that it won't have 40 paths with long enough names to exceed the
> > command line length.  it also has some logic where it's deleting paths by
> > globs, but the process to partition the file list into groups of 40 happens
> > before the glob is expanded, so there are cases where it's 40 globs that can
> > expand into many many more files and then exceed the command line length.
> 
> First, I thought that GNU-ish systems were not supposed to have such 
> arbitrary limits,

one person's "arbitrary limits" is another person's "too small limit" :).
i'm most familiar with Linux, so i'll focus on that.

xargs --show-limits on my Linux-5.15 system says:
Your environment variables take up 5934 bytes
POSIX upper limit on argument length (this system): 2089170
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2083236

2MB ain't too shabby.  but if we consult execve(2), it has more details:
https://man7.org/linux/man-pages/man2/execve.2.html
   On Linux prior to kernel 2.6.23, the memory used to store the
   environment and argument strings was limited to 32 pages (defined
   by the kernel constant MAX_ARG_PAGES).  On architectures with a
   4-kB page size, this yields a maximum size of 128 kB.

i've def seen "Argument list too long" errors in Gentoo from a variety of
packages due to this 128 kB limit (which includes the environ strings).
users are very hostile to packages :p.

   On kernel 2.6.23 and later, most architectures support a size
   limit derived from the soft RLIMIT_STACK resource limit (see
   getrlimit(2)) that is in force at the time of the execve() call.
   (Architectures with no memory management unit are excepted: they
   maintain the limit that was in effect before 

type errors, command length limits, and Awk (was: portability of xargs)

2022-02-15 Thread Jacob Bachmeyer

Mike Frysinger wrote:

context: https://bugs.gnu.org/53340
  

Looking at the highlighted line in the context:

>   echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
It seems that the problem is that am__base_list expects ListOf/File (and 
produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob.  
This works in the usual case because the shell implicitly converts Glob 
-> ListOf/File and implicitly flattens argument lists, but results in 
the overall command line being longer than expected if the globs expand 
to more filenames than expected, as described there.


It seems that the proper solution to the problem at hand is to have 
am__pep3147_tweak expand globs itself somehow and thus provide 
ListOf/File as am__base_list expects.


Do I misunderstand?  Is there some other use for xargs?

I note that the current version of standards.texi also allows configure 
and make rules to use awk(1); could that be useful here instead? (see below)



[...]

automake jumps through some hoops to try and limit the length of generated
command lines, like deleting output objects in a non-recursive build.  it's
not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
and assumes that it won't have 40 paths with long enough names to exceed the
command line length.  it also has some logic where it's deleting paths by
globs, but the process to partition the file list into groups of 40 happens
before the glob is expanded, so there are cases where it's 40 globs that can
expand into many many more files and then exceed the command line length.
  


First, I thought that GNU-ish systems were not supposed to have such 
arbitrary limits, and this issue (the context) originated from Gentoo 
GNU/Linux.  Is this a more fundamental bug in Gentoo or still an issue 
because Automake build scripts are supposed to be portable to foreign 
system that do have those limits?


Second, counting files in the list, as you note, does not necessarily 
actually conform to the system limits, while Awk can track both number 
of elements in the list and the length of the list as a string, allowing 
to break the list to meet both command tail length limits (on Windows or 
total size of block to transfer with execve on POSIX) and argument count 
limits (length of argv acceptable to execve on POSIX).


POSIX Awk should be fairly widely available, although at least Solaris 
10 has a non-POSIX awk in /usr/bin and a POSIX awk in /usr/xpg4/bin; I 
found this while working on DejaGnu.  I ended up using this test to 
ensure that "awk" is suitable:


8<--
# The non-POSIX awk in /usr/bin on Solaris 10 fails this test
if echo | "$awkbin" '1 && 1 {exit 0}' > /dev/null 2>&1 ; then
   have_awk=true
else
   have_awk=false
fi
8<--


Another "gotcha" with Solaris 10 /usr/bin/awk is that it will accept 
"--version" as a valid Awk program, so if you use that to test whether 
"awk" is GNU Awk, you must redirect input from /dev/null or it will hang.


Automake may want to do more extensive testing to find a suitable Awk; 
the above went into a script that remains generic when installed and so 
must run its tests every time the user invokes it, so "quick" was a high 
priority.



-- Jacob