Re: Forwarded: Segmentation Fault via recursive loop in Gawk

2024-03-20 Thread arnold
Note that I said EREs, which don't have to provide backreferences.

Arnold

Paul Eggert  wrote:

> On 3/20/24 01:40, arn...@skeeve.com wrote:
> > It's possible to write a POSIX compliant matcher for EREs that doesn't
> > have such problems; I know someone doing it.
>
> I think matching POSIX regular expressions in polynomial time is 
> equivalent to proving P=NP, i.e., you'll win the Turing Award if you 
> actually pull it off. This is due to back-references.
>
> See, for example:
>
> Câmpeanu C, Santean N. On pattern expression languages. 2007. Proc 
> AuthMathA. https://cs.smu.ca/~nic.santean/art/regex.pdf
>
> For a more programmer-friendly summary of the problem, and the subset of 
> POSIX REs that you can match more quickly, see:
>
> Pinto PED, Lopes JPA, de Brito MAS. A polynomial-time regular 
> expressions implementation. 2017. Cadernos Do IME - Série Informática, 
> 37(Junho), 22–36. https://doi.org/10.12957/cadinf.2016.13778



Re: Forwarded: Segmentation Fault via recursive loop in Gawk

2024-03-20 Thread arnold
Hi.

Paul Eggert wrote:
> In glibc (and Gnulib) the regular-expression code has long been 
> maintained under the philosophy that the code cannot handle arbitrary 
> regular expressions. Any code that lets the user specify an arbitrary 
> regular expression is suspect, and this includes Awk scripts. (This is 
> also true for C libraries other than glibc/Gnulib.)
> 
> It'd be nice if someone could fix regex bugs like these in the glibc 
> regex code, but nobody has stepped forward to do that, and frankly it's 
> low priority. In the meantime, don't write Awk scripts with adversarial 
> regexps.

Thanks. This is more or less what I expected, and it's fine with me.
But I had to do my duty as gawk maintainer and forward the report.

Bruno Haible  wrote:
> Stack overflow inside the regex engine is only one of the problems. The
> other one is quadratic (or even exponential) running time. Such a running
> time can have fatal practical consequences [1]. The RE2 regex syntax [2]
> was designed to avoid such problems. But here, we are using POSIX regexes,
> which will always exhibit worst-case exponential running times.
>
> Bruno
>
> [1] https://blog.cloudflare.com/cloudflare-outage/
> [2] https://en.wikipedia.org/wiki/RE2_(software)

It's possible to write a POSIX compliant matcher for EREs that doesn't
have such problems; I know someone doing it.  In any case, users
get what they ask for, it's up to them to understand what they're doing.

Thanks,

Arnold
>



Forwarded: Segmentation Fault via recursive loop in Gawk

2024-03-19 Thread arnold
Hello.

Please see this report sent to the gawk list concerning regcomp.c.
I have attached his "POCFILE".

Thanks,

Arnold

> From: ttfish 
> Date: Tue, 19 Mar 2024 21:48:34 +0800
> Subject: Segmentation Fault via recursive loop in Gawk
> To: bug-g...@gnu.org
> Cc: secur...@gnu.org
>
> Content-Type: text/plain; charset="UTF-8"
>
> Dear GNU gawk developers,
>
> Greetings. I am writing to report a recursive loop bug found in gawk.
>
> ## Description
>
> The bug is located in the support/regcomp.c file within the parse_reg_exp
> function. The vulnerability involves function "parse_expression",
> "parse_branch" and "parse_sub_exp" and exists in latest stable release
> (gawk 5.3.0) and the latest master branch
> (ff873ce52bf6a1766935281883b74b49edc7d38f, updated on March 04, 2024). The
> inner variable of `preg`, `token`, `syntax` and `nest` would stick with
> unchanged values in loop calling and lead to segmentation fault.
>
> ## Proof of Concept
>
> The attached PoC could result segmentation fault and subsequent program
> termination.
>
> It could be reproduced by the attached PoC file with input:
>
> ```bash
> gawk -f POC-FILE {anyfile}
> ```
>
> The backtrace log could be found below:
>
> ```bash
> #4  0x006f3121 in parse_expression (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4466, err=0x75c09b20) at ./regcomp.c:2242
> #5  0x006f243d in parse_branch (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4466, err=0x75c09b20) at ./regcomp.c:2169
> #6  0x006ee668 in parse_reg_exp (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4466, err=0x75c09b20) at ./regcomp.c:2121
> #7  0x006f4e72 in parse_sub_exp (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4466, err=0x75c09b20) at ./regcomp.c:2456
> #8  0x006f3121 in parse_expression (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4465, err=0x75c09b20) at ./regcomp.c:2242
> #9  0x006f243d in parse_branch (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=4465, err=0x75c09b20) at ./regcomp.c:2169
> #10 0x006ee668 in parse_reg_exp (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
>
> # repeat ...
>
> #17868 0x006f3121 in parse_expression (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=0, err=0x75c09b20) at ./regcomp.c:2242
> #17869 0x006f265a in parse_branch (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=0, err=0x75c09b20) at ./regcomp.c:2176
> #17870 0x006ee668 in parse_reg_exp (regexp=0x75c09b30,
> preg=0x50b1a920, token=0x75aca4e0, syntax=2339405,
> nest=0, err=0x75c09b20) at ./regcomp.c:2121
> #17871 0x006e6db2 in parse (regexp=0x75c09b30,
> preg=0x50b1a920, syntax=2339405, err=0x75c09b20)
> at ./regcomp.c:2089
> #17872 0x006dd100 in re_compile_internal (
> preg=0x50b1a920,
> pattern=0x52c10200
> "()\326*()+\\2+()*\\2\277()\326*))*\\W3^\\e<\"\003^*", '('  times>..., length=28345, syntax=2339405)
> at ./regcomp.c:764
> #17873 0x006dc5ca in re_compile_pattern (
> pattern=0x52c10200
> "()\326*()+\\2+()*\\2\277()\326*))*\\W3^\\e<\"\003^*", '('  times>..., length=28345,
> bufp=0x50b1a920) at ./regcomp.c:217
> #17874 0x006a4128 in make_regexp (
> s=0x52c08200
> "()\326*()+\\5342+()*\\5342\277()\326*))*\\W3^\\e<\"\003^*", '('  160 times>..., len=28345, ignorecase=false,
> dfa=true, canfatal=false) at re.c:257
> #17875 0x005944c4 in make_regnode (type=Node_regex,
> exp=0x52609720)
> at /home/ttfish/Project/2024/DSLFuzz/gawk/awkgram.y:5297
> #17876 0x005728a6 in yyparse ()
> at /home/ttfish/Project/2024/DSLFuzz/gawk/awkgram.y:572
> #17877 0x0059fe3d in parse_program (
> pcode=0x113d8a0 , from_eval=false)
> at /home/ttfish/Project/2024/DSLFuzz/gawk/awkgram.y:2803
> #17878 0x006783e8 in main (argc=4, argv=0x7fffd9c8)
> at main.c:504
> ```
>
> ## Impact
>
> This vulnerability allows attackers to cause a denial of service by
> crashing the gawk instance or malicious memory manipulation.
>
> ## Attachments
>
> Please find the attached PoC file in the attachment.
>
> Please feel free to contact me if you have any further questions.
>
> Best regards,
> ttfish



POCFILE
Description: Binary data


Re: From wchar_t to char32_t

2023-07-02 Thread arnold
Hi.

Bruno Haible  wrote:

> Arnold: I have added '#if GAWK' conditionals, knowing that gawk's build system
> does not use gnulib-tool and you therefore pull from gnulib manually. This
> means the improvements will not land in gawk, since dfa in gawk will continue
> to use wchar_t.

Much thanks.

> Objections?

None at the moment. I am super busy but will eventually get to this.
If you can summarize to me, in private mail, what's going on I'd
appreciate it. If not, I will eventually read the links, but I don't
have spare time right now.

Thanks!

Arnold



Re: gnulib files issue compiling gawk with revived PCC

2023-05-02 Thread arnold
Hi.

Thanks for all this. I will review the changes and integrate
them as works for me.

I appreciate the help.

Arnold

Paul Eggert  wrote:

> On 2023-04-30 11:28, Aharon Robbins wrote:
> > This would seem to be due to the expansion of the INT_MULTIPLY_WRAPV
> > macro.  I tried following its definition, but got lost in the maze
> > of twisty little ifdefs.
>
> It's gotta be a bug in pcc's preprocessor: it's not expanding that 
> INT_MULTIPLY_WRAPV at all, and is simply erroring out. The attached 
> patches fix this not by tracking the bug down, but by working around it. 
> They change dfa.c to use ckd_mul (i.e., C23 style) instead of 
> INT_MULTIPLY_WRAPV (older Gnulib style). We are gradually changing 
> Gnulib to C23 style anyway, so this change is a win regardless of pcc's 
> bugs.
>
> I found several other problems with pcc and Gnulib as used in Gawk, and 
> made the following changes to Gnulib to port to Ubuntu 23.04 pcc:
>
> * pcc  doesn't define MB_LEN_MAX; fixed by 
> <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=98deb4fad3bdc7986274feebac3f0f8a50fdce0a>.
>
> * pcc -E errors out on INT_MULTIPLY_WRAPV in Gnulib modules that Gawk 
> uses; fixed by 
> <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=e915c32cc74671a03a4f656bdbbe9b8103a5ff19>,
>  
> <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=a1d7a312646ec112140f4a3e112daac2194549df>,
>  
> and 
> <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=bdc715b1f7a4eee75214709d4a949bdf65bcc9a2>.
>
> * Even though pcc claims to be GCC 4 and to support C11 extern inline, 
> it doesn't work; fixed by 
> <https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=20022b888d0da7f927fd18cb8f18d78f8ac03107>.
>
> To test the above with Gawk, I propagated recent Gnulib into my copy of 
> Gawk; see attached patches 0001-0010. pcc also mishandled some of Gawk's 
> own code, so I made five changes to Gawk directly; see patches 
> 0010-0015. Patch 0016 simply regenerates all autogenerated files. With 
> all these patches installed Gawk "./configure CC=pcc; make check" works 
> on Ubuntu 23.04 x86-64.
>
> Although these patches may seem large, almost all of them are simply 
> copies from Gnulib, or autogenerated. The parts I wrote by hand are 
> mostly summarized in the attached patch summary.patch, so I suggest 
> looking at that first. summary.patch is meant for human review; the 
> other patches can be slurped into Gawk simply via "git am 0*.patch".
>
> I'll cc this email to bug-gawk as I think Gnulib is now fixed for pcc, 
> and the attached patches are for Gawk not Gnulib.



Re: Clang-built Gawk 5.2.1 regex oddity

2023-01-06 Thread arnold
Thanks for the update.

Paul, let's leave dfa.c as is, with the modified code.
It's much easier to read anyway.

Thanks,

Arnold

Arsen Arsenović  wrote:

> Hi,
>
> Paul Eggert  writes:
>
> > This is a serious bug in Clang: it generates incorrect machine code.
> >
> > The code that Clang generates for the following (gawk/support/dfa.c lines
> > 1141-1143):
> >
> > ((dfa->syntax.dfaopts & DFA_CONFUSING_BRACKETS_ERROR
> >   ? dfaerror : dfawarn)
> >  (_("character class syntax is [[:space:]], not [:space:]")));
> >
> > is immediately followed by the code generated for the following
> > (gawk/support/dfa.c line 1015):
> >
> > dfaerror (_("invalid character class"));
> >
> > and this is incorrect because the two source code regions are not connected
> > with each other.
>
> This is now fixed in Clang:
> https://reviews.llvm.org/rGcf8fd210a35c8e93136cb8edc5c6a2e818dc1b1d
>
> Happy hacking!
> -- 
> Arsen Arsenović



Re: Clang-built Gawk 5.2.1 regex oddity

2023-01-01 Thread arnold
Hi Sam,

Thanks for the further info.

Looking at both bits of dfa.c code, I don't see how either can be
undefined behavior.

In any case, dfa.c is copied directly from GNULIB, so I am cc-ing
bug-gnulib.

Paul & Jim, for background, please see the thread at
https://lists.gnu.org/archive/html/bug-gawk/2022-12/msg00010.html.

This still smells like "compiler bug" to me, but even if not,
the GNULIB folks need to look at it.

I will take a look at testdfa; it's been a while since I've had to
use it, so maybe something has gotten out of sync.

Thanks,

Arnold

Sam James  wrote:

> > On 30 Dec 2022, at 09:13, arn...@skeeve.com wrote:
> > 
> > Hi.
> > 
> > Thanks for the report.
> > 
> > Although the dfa and regex code changed some between releases,
> > this smells strongly like a compiler issue and not a gawk issue.
> > 
> > I suggest first that you try compiling with clang but without
> > optimization. After running configure, edit the top level Makefile *and*
> > support/Makefile and remove any -O flags.  Then build.
>
> Kenton mentioned to me that with no optimisation, it works okay.
>
> > If the bug goes away, it's definitely a clang issue.
>
> It _probably_ is, but it's also possible it's UB. I tried building with UBSAN
> (as did Kenton) and we both got this when running the command he posted
> when built with Clang:
> ```
> $ ./configure CC=clang CFLAGS="-O2 -fsanitize=undefined -ggdb3" 
> LDFLAGS="-fsanitize=undefined -ggdb3"
> $ make
> $ export UBSAN_OPTIONS=print_stacktrace=1
> $ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
> dfa.c:1141:6: runtime error: execution reached an unreachable program point
> #0 0x5db652 in parse_bracket_exp /tmp/gawk/support/dfa.c:1141:6
> #1 0x5c241a in lex /tmp/gawk/support/dfa.c:1543:37
> #2 0x5dc8f1 in atom /tmp/gawk/support/dfa.c:1888:24
> #3 0x5dc8f1 in closure /tmp/gawk/support/dfa.c:1961:3
> #4 0x5dc022 in branch /tmp/gawk/support/dfa.c:2002:3
> #5 0x5c7082 in regexp /tmp/gawk/support/dfa.c:2014:3
> #6 0x5c0e32 in dfaparse /tmp/gawk/support/dfa.c:2042:3
> #7 0x5c76c2 in dfacomp /tmp/gawk/support/dfa.c:3812:5
> #8 0x5abb33 in make_regexp /tmp/gawk/re.c:272:3
> #9 0x56dffd in set_RS /tmp/gawk/io.c:4092:14
> #10 0x50510b in r_interpret /tmp/gawk/./interpret.h
> #11 0x5754d7 in main /tmp/gawk/main.c:538:3
> #12 0x7f7bb5df464f in __libc_start_call_main 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
> #13 0x7f7bb5df4708 in __libc_start_main@GLIBC_2.2.5 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
> #14 0x4092a4 in _start 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115
>
> SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dfa.c:1141:6 in # 
> (yes, this is cut off, I don't know why!)
> ```
>
> If I build with ASAN instead with Clang:
> ```
> $ ./configure CC=clang CFLAGS="-O2 -fsanitize=address -ggdb3" 
> LDFLAGS="-fsanitize=address -ggdb3"
> $ make
> $ ./gawk 'BEGIN { RS="[[][:blank:]]" }'
> =
> ==1517313==ERROR: AddressSanitizer: unknown-crash on address 0x7fa647137000 
> at pc 0x00658214 bp 0x7ffe59482ad0 sp 0x7ffe59482ac8
> READ of size 8 at 0x7fa647137000 thread T0
> #0 0x658213 in setbit /tmp/gawk/support/dfa.c:746:33
> #1 0x658213 in setbit_case_fold_c /tmp/gawk/support/dfa.c:868:7
> #2 0x658213 in parse_bracket_exp /tmp/gawk/support/dfa.c:1095:27
> #3 0x64b6d0 in lex /tmp/gawk/support/dfa.c:1543:37
> #4 0x6588dd in atom /tmp/gawk/support/dfa.c:1888:24
> #5 0x6588dd in closure /tmp/gawk/support/dfa.c:1961:3
> #6 0x64d84c in branch /tmp/gawk/support/dfa.c:2002:3
> #7 0x64d84c in regexp /tmp/gawk/support/dfa.c:2014:3
> #8 0x64aad6 in dfaparse /tmp/gawk/support/dfa.c:2042:3
> #9 0x64dbb7 in dfacomp /tmp/gawk/support/dfa.c:3812:5
> #10 0x6404df in make_regexp /tmp/gawk/re.c:272:3
> #11 0x611b66 in set_RS /tmp/gawk/io.c:4092:14
> #12 0x5c693b in r_interpret /tmp/gawk/./interpret.h
> #13 0x616e6b in main /tmp/gawk/main.c:538:3
> #14 0x7fa646ccc64f in __libc_start_call_main 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
> #15 0x7fa646ccc708 in __libc_start_main@GLIBC_2.2.5 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../csu/libc-start.c:381:3
> #16 0x420df4 in _start 
> /var/tmp/portage/sys-libs/glibc-2.36-r6/work/glibc-2.36/csu/../sysdeps/x86_64/start.S:115
>
> Address 0x7fa647137000 is a wild pointer i

intprops.h and friends problem with pcc

2022-12-03 Thread arnold
Hi.

I am trying to compile gawk with the "pcc revived" compiler. You can
get it from https://github.com/arnoldrobbins/pcc-revived.

git clone https://github.com/arnoldrobbins/pcc-revived
cd pcc-revived
git checkout ubuntu-18  # changes for modern linux
./make-tmp.sh   # build and install under /tmp/pcc
export PATH=/tmp/pcc/bin:$PATH

Next, clone the gawk repo and then

cd gawk
./bootstrap.sh && ./configure CC=pcc
make

I get:

| $ make
| make  all-recursive
| make[1]: Entering directory '/home/arnold/Gnu/gawk/gawk.git'
| Making all in support
| make[2]: Entering directory '/home/arnold/Gnu/gawk/gawk.git/support'
| depbase=`echo dfa.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
| pcc -DGAWK -DHAVE_CONFIG_H -I"./.." -I. -I..-g -O2 -DNDEBUG -DNDEBUG -g 
-O2 -DNDEBUG -DNDEBUG -MT dfa.o -MD -MP -MF $depbase.Tpo -c -o dfa.o dfa.c &&\
| mv -f $depbase.Tpo $depbase.Po
| xalloc.h:418: error: wrong arg count

It doesn't like the INT_MULTIPLY_WRAPV macro for some reason.  Other
uses of this macro in other files copied from gnulib die similarly.

These macros are extremely opaque. Please investigate, using pcc used
to work ok, but keeping up with gnulib has caused it to break.

Thanks,

Arnold



recent changes break gawk compilation

2022-10-15 Thread Arnold Robbins
Hi.

In trying to keep gawk up to date with gnulib, I find that there are
several recent changes that break compilation. I'm using Ubuntu 22.04.

In the files dfa.h, localeinfo.h, malloc/dynarray.h, and regex_internal.h,
the include of  was removed. I don't understand why.

In localeinfo.c, a static_assert or two are used instead of the previous
verify() macro.  Gawk sticks to c99 compatibility to support VMS which
only has a C99 front end on the compiler.

Can all of the above be reverted in gnulib please?

Thanks,

Arnold



Re: bug#20657: Accepting [xyz---abc] - three minus signs to mean one

2022-04-24 Thread arnold
Paul Eggert  wrote:

> On 4/24/22 06:21, arn...@skeeve.com wrote:
> > I plan to add a test to gawk; perhaps grep would benefit from one as well?
>
> That'd need more than just a test, as we'd need to also modify regex.m4 
> to arrange to replace any system regex that has this incompatibility 
> with gnulib regex. And we'd need to document the extension since we 
> shouldn't test undocumented features. Although such work could be done, 
> I expect it'd be a more productive use of our limited time to get this 
> extension into glibc first. I'll add that to my (long) list of things to do.

OK - I agree that getting this into glibc is higher priority.

Thanks,

Arnold



Re: Accepting [xyz---abc] - three minus signs to mean one

2022-04-24 Thread arnold
Hi Paul.

Thanks for this. The patch looks good. I will (eventually) merge it
into gawk instead of my change.

I plan to add a test to gawk; perhaps grep would benefit from one as well?

Thanks,

Arnold

Paul Eggert  wrote:

> On 4/21/22 00:57, Arnold Robbins wrote:
>
> > As far as my testing indicates, dfa.c doesn't need a patch, it seems
> > to accept "---" inside brackets for a single minus.
>
> Yes, a brief perusal of the dfa.c source code suggests you're right. 
> Thanks for looking into this. I tend to agree with you that POSIX is not 
> likely to outlaw this extension.
>
>
> > If there are no objections, can we get this into Gnulib?
>
> Although the basic idea looks good, I see a few places where the patch 
> can be improved.
>
> * The two calls to re_string_peek_byte might go past the end of the 
> pattern (a subscript violation). This is possible because the pattern is 
> not necessarily null-terminated.
>
> * The two calls to re_string_fetch_byte can be simplified into a single 
> call to re_string_skip_bytes.
>
> * No need to assign to token->opr.c, as it already has the correct value.
>
> * Can fall through to the default case to save a bit of duplicate code.
>
> * glibc still uses comments /* like this */ for style reasons, and we 
> should stick to that.
>
> I wrote a patch with these improvements in mind and installed it into 
> Gnulib (see attached); hope it works for Gawk too.



Re: Accepting [xyz---abc] - three minus signs to mean one

2022-04-21 Thread arnold
Hi.

Bruno Haible  wrote:

> Is there some realistic possibility that the POSIX regex syntax might be
> extended in the future, in such a way that [^0-9---] means something
> different?

That shouldn't happen, as one can point at V7 Unix and Unix awk and
mawk as treating --- as "-" since forever. Existing practice trumps
(or should trump) innovative new interpretations.

> If that happens, and if we opt now to assign a meaning to this
> regex, we would have to choose between POSIX compliance and backward
> compatibility — a bad situation.

I don't think it's a realistic worry.

My two cents of course.  I have already pushed this change in gawk's
copy of regex.

Thanks,

Arnold



Accepting [xyz---abc] - three minus signs to mean one

2022-04-21 Thread Arnold Robbins
Greetings.

Way back in May of 2015, Nelson Beebe submitted the following
bug report for gawk:

> Date: Mon, 25 May 2015 14:21:04 -0600 (MDT)
> From: "Nelson H. F. Beebe" 
> To: "Arnold Robbins" 
> Cc: be...@math.utah.edu
> Subject: gawk-4.1.3 regexp error
> 
> I just ran an old (1996--date) awk program with gawk-4.1.3 and got an
> error that can be exhibited like this:
> 
>   % gawk '/[^0-9---]/ {print}'
>   gawk: cmd. line:1: error: tent of \{\}: /[^0-9---]/
> 
> As far as I can see, that is a perfectly valid range expression, and
> using three hyphens to represent one hyphen is the traditional way
> to incorporate a hyphen in the expression.

The upshot was that regex didn't support this, and I didn't (at the
time) want to tackle trying to fix it.  (I did fix the error message,
at least.)

I submitted a bug report about it. At the time, Paul Eggert said the following:

> Date: Mon, 25 May 2015 23:53:31 -0700
> From: Paul Eggert 
> To: arn...@skeeve.com, 20...@debbugs.gnu.org
> Subject: Re: bug#20657: Traditional range expression not accepted in regex/dfa
> 
> arn...@skeeve.com wrote:
> 
> > The bugaboo here is the "---"; it's
> > a range expression consisting of minus through minus, and apparently long
> > ago was how one got a minus into a bracket expression.
> 
> Actually, long ago expressions like '[^0-9-]' worked just as they do now,
> and it wasn't ever necessary to use trailing "---".  That being said,
> it is true that in 7th Edition Unix '[^0-9---]' meant the same thing as
> '[^0-9-]', so in that sense we have an incompatibility with 7th Edition
> Unix here.
> 
> > $ ./src/grep '[^0-9---]' /dev/null
> > ./src/grep: Invalid range end
> >
> > The underlying regex and, I believe, dfa routines don't accept this.
> 
> Yes, that's correct.  It's not a bug, though, as the regexp is ambiguous
> and does not conform to POSIX, which says the following about RE
> bracket expressions: "To use a  as the starting range point,
> it shall either come first in the bracket expression or be specified
> as a collating symbol; for example, "[][.-.]-0]", which matches either
> a  or any character or collating element that
> collates between  and 0, inclusive."
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05>
>  
> In your correspondent's example, the hyphen is a starting range point
> but is neither first in the bracket expression nor is specified as a
> collating symbol, so the regexp doesn't conform to POSIX.
> 
> Even though it's not a bug I suppose it wouldn't hurt to make the GNU
> matchers compatible with 7th Edition Unix here, if someone really wants
> to take that task on; it's not urgent, though.

I had some time yesterday, and feeling brave and a little stronger in
The Force than usual, I came up the with the attached patch. It doesn't
break any of my tests.

As far as my testing indicates, dfa.c doesn't need a patch, it seems
to accept "---" inside brackets for a single minus.

If there are no objections, can we get this into Gnulib?

Thanks,

Arnold
diff --git a/support/regcomp.c b/support/regcomp.c
index b607c853..adfe28e2 100644
--- a/support/regcomp.c
+++ b/support/regcomp.c
@@ -2039,7 +2039,21 @@ peek_token_bracket (re_token_t *token, re_string_t *input, reg_syntax_t syntax)
   switch (c)
 {
 case '-':
-  token->type = OP_CHARSET_RANGE;
+  // Special case. V7 Unix grep and Unix awk and mawk allow
+  // [...---...] (3 minus signs in a bracket expression) to represent
+  // a single minus sign.  Let's try to support that without breaking
+  // anything else.
+  if (re_string_peek_byte (input, 1) == '-' && re_string_peek_byte (input, 2) == '-')
+	{
+	   // advance past the minus signs
+	   (void) re_string_fetch_byte (input);
+	   (void) re_string_fetch_byte (input);
+
+	   token->type = CHARACTER;
+	   token->opr.c = '-';
+	}
+  else
+	token->type = OP_CHARSET_RANGE;
   break;
 case ']':
   token->type = OP_CLOSE_BRACKET;


Re: gawk-5.1.1 bug report

2022-04-06 Thread arnold
Paul Eggert  wrote:

> On 4/6/22 01:24, arn...@skeeve.com wrote:
> > Most people
> > would wonder "Why is there a bitwise and here?" and not think of it
> > as a logical and.
>
> I'm not sure I agree about the "most", as I expect most people won't 
> notice or care about this level of detail. However, for people who 
> wonder like that, about adding an explanatory comment? That will help 
> people who are unaccustomed to this valid and useful (albeit 
> less-common) programming style. Something like the attached (untested) 
> patch, perhaps?
>
> > & for a logical test can be dangerous since any non-zero
> > value can be true.
>
> Sure, but that's an issue only when using & on types like 'int'. It's 
> not an issue when using & on 'bool'. Similarly, + has rounding issues on 
> 'float' but that doesn't mean we need to worry about +'s rounding issues 
> on 'int'.
I don't care for the diff. It's a lot more work than changing & to &&,
but as I have my own copy of dfa.c I won't worry about it.

Thanks,

Arnold



Re: gawk-5.1.1 bug report

2022-04-06 Thread arnold
Paul Eggert  wrote:

> On 4/6/22 00:04, arn...@skeeve.com wrote:
> > IMHO clear code beats saving a single branch
>
> Sure, but clarity also argues for "&" over "&&" here. Writing "f(x) && 
> f(y)" would incorrectly imply that it's important that f(y) should not 
> be evaluated when f(x) is false, an implication that is incorrect here. 
> Writing "f(x) & f(y)" tells the reader that both sides are safe to 
> evaluate and that they can be evaluated in either order, something I 
> found worth knowing when I read that part of the code.

Only because you have umpteen years of C programming. Most people
would wonder "Why is there a bitwise and here?" and not think of it
as a logical and.

I'll stick to my opinion that && is better here since we're doing
logical tests; the short-circuit nature of && is less important.

In addition, & for a logical test can be dangerous since any non-zero
value can be true.  Even though you're using bool functions, &&
guarantees a logical true/false instead of an accidental one.

Thanks,

Arnold



Re: gawk-5.1.1 bug report

2022-04-06 Thread arnold
> On 4/5/22 22:18, arn...@skeeve.com wrote:
> >   dfa.c:1093:27: warning: use of bitwise '&' with boolean operands 
> > [-Wbitwise-instead-of-logical]

Paul Eggert  wrote:
> It's valid in C to use bitwise '&' on bool, and doing so here eliminates 
> a conditional branch at the machine level, which can be a win.
>
> How about if you disable -Wbitwise-instead-of-logical instead, since 
> it's a false alarm?

IMHO clear code beats saving a single branch, which is something I
doubt you can even measure on modern systems.  As Knuth said,
"Premature optimization is the root of all evil." I think that
applies here.

I (at least) request that you make the change in dfa.c.

Thanks,

Arnold



Re: gawk-5.1.1 bug report

2022-04-05 Thread arnold
Hi.

Thanks for the report.  I am cc'ing the GNULIB guys, as they
are the upstream for dfa.c.  In the meantime, I will make this
change in gawk.

Thanks!

Arnold

David Binderman  wrote:

> Hello there,
>
> I just tried to compile gawk-5.1.1 with new clang-14. It said
>
>  dfa.c:1093:27: warning: use of bitwise '&' with boolean operands 
> [-Wbitwise-instead-of-logical]
>
> Source code is
>
>   || (isasciidigit (c) & isasciidigit (c2)))
>
> Maybe better code is
>
>   || (isasciidigit (c) && isasciidigit (c2)))
>
> Regards
>
> David Binderman
>



Re: non-gnulib bug in dfa.h

2021-08-29 Thread arnold
Paul Eggert  wrote:

> On 8/29/21 8:55 AM, arn...@skeeve.com wrote:
> > Hi.
> > 
> > I had to make the change below to dfa.h to get things to compile
> > in gawk.  Please apply this.
>
> Sorry about my typo, and thanks for the fix; I installed it.

No problem. Much thanks.

Arnold



non-gnulib bug in dfa.h

2021-08-29 Thread arnold
Hi.

I had to make the change below to dfa.h to get things to compile
in gawk.  Please apply this.

Thanks,

Arnold
-
--- /usr/local/src/Gnu/gnulib/lib/dfa.h 2021-08-27 16:50:39.579581132 +0300
+++ support/dfa.h   2021-08-29 18:30:25.101719167 +0300
@@ -50,6 +50,7 @@
 #ifndef _GL_ATTRIBUTE_MALLOC
 # define _GL_ATTRIBUTE_MALLOC
 # define _GL_ATTRIBUTE_DEALLOC_FREE
+# define _GL_ATTRIBUTE_DEALLOC(x,y)
 # define _GL_ATTRIBUTE_RETURNS_NONNULL
 #endif
 



Re: possible bug in regex and dfa

2021-07-18 Thread arnold
Hi.

Bruno Haible  wrote:

>   - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
> after the newline.

This is indeed the desired behavior, but regex isn't following it.

REG_NEWLINE being set gets translated into preg->newline_anchor. 

Starting at line 620, regexec.c relates to it:

|   /* If initial states with non-begbuf contexts have no elements,
|  the regex must be anchored.  If preg->newline_anchor is set,
|  we'll never use init_state_nl, so do not check it.  */
|   if (dfa->init_state->nodes.nelem == 0
|   && dfa->init_state_word->nodes.nelem == 0
|   && (dfa->init_state_nl->nodes.nelem == 0
| || !preg->newline_anchor))
| {
|   if (start != 0 && last_start != 0)
| return REG_NOMATCH;
|   start = last_start = 0;
| }

(As a side note, I don't think the comment matches the code.)

In my case, preg->newline_anchor is zero (correctly), but
dfa->init_state->nodes.nelem is not, so this body isn't executed.
Making the test for preg->newline_anchor the first thing causes my test
case to work correctly but breaks the gawk test suite.

In other words, I think the bug is somewhere in this area, but I
don't understand the regex internals enough to fix it.  dfa will also
need looking at.

Thanks,

Arnold



Re: possible bug in regex and dfa

2021-07-18 Thread arnold
Bruno Haible  wrote:

> Hi Arnold,
>
> > Dot matching newline isn't the issue here.
> > 
> > It's ^ matching in the middle of a string.  For my purposes, ^ should
> > only match at the beginning of a *string* (as $ should only match at
> > the end of a string).  I haven't rechecked POSIX, but this is how awk
> > has behaved since forever.
>
> Hmm. Regarding POSIX: I've read section 9.3.8 and 9.4.9 of [1],
> the description of REG_NOTBOL, REG_NOTEOL in [2], and the description
> of REG_NEWLINE in [3]. If I understand it correctly, within POSIX,
> ".^" should not match a newline because
>   - if REG_NEWLINE is set, '^' matches after the newline but '.' does not
> match the newline,
>   - if REG_NEWLINE is not set, '.' matches newline but '^' does not match
> after the newline.

That makes sense.  This is why I felt that, for gawk, ".^" is an invalid
regexp. (Indeed, the original Unix awk rejects it as such.)

REG_NEWLINE is not included in any of the RE_*_AWK definitions since I
want exactly the behavior you describe: dot matches newline but ^ does
not match after the newline.

To me this feels very much like a bug.

> However, GNU regex.h also has a flag RE_CONTEXT_INDEP_ANCHORS; I don't know
> what effect it has.

In this case it makes things worse, causing gawk to match ".^" literally.

> > (And how I've documented things in the manual, also since forever.)
>
> If you want the behaviour of the GNU regex to be stable over time, you
> should contribute unit tests to tests/test-regex.c.

This is a separate issue. It almost sounds like you're saying "it's your
fault there's a bug here, you didn't contribute unit tests".  I hope
that's not your intent; if it is then sorry, I don't buy it.

In any case, I've supplied a regexp, input data, and in the gawk dist,
a test harness, so that debugging can be done if one of the Gnulib
maintainers will look into this particular issue.

Thanks,

Arnold



Re: possible bug in regex and dfa

2021-07-18 Thread arnold
Hi.

Paul Eggert  wrote:

> On 7/15/21 1:48 PM, Arnold Robbins wrote:
> > The regexp used there, ".^", to my mind should be treated as invalid.
>
> No, that regular expression is valid because "." matches newline in 
> POSIX EREs. So the "." matches a newline, and the following "^" matches 
> the start of the next line.

Bruno Haible  wrote:

> > No, that regular expression is valid because "." matches newline in 
> > POSIX EREs.
>
> And if you don't like this, you need to remove the RE_DOT_NEWLINE flag from
> the value that you pass to re_set_syntax.

Dot matching newline isn't the issue here.

It's ^ matching in the middle of a string.  For my purposes, ^ should
only match at the beginning of a *string* (as $ should only match at
the end of a string).  I haven't rechecked POSIX, but this is how awk
has behaved since forever. (And how I've documented things in the manual,
also since forever.)

For RS, gawk treats the concatenation of the input files as one long
string, so ^ should only match at the very beginning, and $ at the
very end.

But even for strings the GNU regex routines seem to get it wrong:

$ cat y.awk
BEGIN {
data = "a.^b\na.^b\n"
gsub(/.^/, ">&<", data)
print data
}

$ mawk -f y.awk # gets it right IMHO
a.^b
a.^b

$ nawk -f y.awk
nawk: syntax error in regular expression .^ at 
 source line number 3 source file y.awk
 context is
gsub(/.^/, ">&<", >>>  data) <<< 

$ ./gawk -f y.awk
a.^b>

<

Is there some way I can get the regex routines (and dfa) to relate
to ^ and $ as relative to the *string* and not the *line*?

Thanks,

Arnold



possible bug in regex and dfa

2021-07-15 Thread Arnold Robbins
Hi.

Please see the thread starting at

https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00026.html

The regexp used there, ".^", to my mind should be treated as invalid.
Mawk does so, reading the entire file as one record.  Gawk matches a
newline for it:

$ cat data
a.^b
a.^b

$ cat x.awk
BEGIN { RS = ".^" }

{
gsub(/.^/, ">&<")
print NR, $0
print "RT=<" RT ">"
}

$ mawk -f x.awk data
1 a.^b
a.^b

RT=<>

$ ./gawk -f x.awk data
1 a.^b
RT=<
>
2 a.^b
RT=<
>

To make debugging easier, there is a test program in the gawk
git repo that just does regexp matching the way gawk does, called
testdfa.  To use it,

git clone git://git.savannah.gnu.org/gawk.git
cd gawk
./bootstrap && ./configure
## edit Makefile and support/Makefile to remove -O, add -g
make -j
cd helpers
gcc -g -I.. -I../support testdfa.c ../support/libsupport.a -o testdfa

When run:

$ cd helpers
$ ./testdfa -b '.^' < ../data
Ignorecase: false
Syntax: 
RE_BACKSLASH_ESCAPE_IN_LISTS|RE_CHAR_CLASSES|RE_CONTEXT_INDEP_ANCHORS|RE_DOT_NEWLINE|RE_INTERVALS|RE_NO_BK_BRACES|RE_NO_BK_PARENS|RE_NO_BK_VBAR|RE_NO_EMPTY_RANGES|RE_UNMATCHED_RIGHT_PAREN_ORD|RE_INVALID_INTERVAL_ORD
Pattern: /.^/, len = 2
After setup_pattern(), len = 2
MB_CUR_MAX = 6
Calling dfacomp(.^, 2, 0x55e9d56a5600, true)
re_search returned position 4 (true)
dfaexec returned 5 (a.^)

If this is supposed to match a newline, I'd like to understand why.
If it's not, I'd like to get a fix for regexp and dfa.  Or if
RE_SYNTAX_GNU_AWK needs more or fewer syntax bits[1], I'd like to
know which, and why.

Please cc me on any and all replies, as I'm not subscribed to
this list.

Thanks,

Arnold

[1] I hate the syntax bits. I have hated them for decades. Sigh.



Re: warnings from MacOS clang

2021-05-28 Thread arnold
Paul Eggert  wrote:

> On 5/27/21 1:46 PM, Eric Blake wrote:
>
> > Yet another portable solution is:
> > 
> > static mbstate_t s1;
> > mbstate_t s = s1;
> > 
> > also with its own form of ugliness.
>
> I did that years ago, but compilers complained about it when I made s1 
> 'const', and I vaguely recall complaints even when it wasn't 'const' 
> ("What? You're declaring a static variable that is always zero and never 
> changes? That must be a bug!!").
>
> At this point I wouldn't worry about the older clang and gcc versions 
> that complain about {0} as an initializer. We can either let them die 
> off noisily, or use the appropriate -Wno-whatever option when using them 
> to compile.

I've decided to just not worry about it. It's impossible to compile
without warnings on every single C compiler in the world.

Thanks,

Arnold



warnings from MacOS clang

2021-05-12 Thread arnold
Hi.

I got the below from one of my testers. If y'all feel like updating the
relevant files in GNULIB, that'd be great. If instead you feel like,
well, to heck with that, that's also OK. :-)

Thanks,

Arnold

> From: Pat Rankin 
> Date: Mon, 10 May 2021 18:13:33 -0700
>
> > https://www.skeeve.com/gawk/gawk-5.1.1c.tar.gz
>
> OSX 10.11.6
> Building after using 'touch .developing' for the first time, I get
>
> depbase=`echo dfa.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
> gcc -DGAWK -DHAVE_CONFIG_H -I"./.." -I. -I..   -I/opt/local/include
> -I/opt/local/include -g -O2 -DARRAYDEBUG -DYYDEBUG -DLOCALEDEBUG
> -DMEMDEBUG -Wall -fno-builtin -g3 -ggdb3 -g -O2 -DARRAYDEBUG -DYYDEBUG
> -DLOCALEDEBUG -DMEMDEBUG -Wall -fno-builtin -g3 -ggdb3 -MT dfa.o -MD
> -MP -MF $depbase.Tpo -c -o dfa.o dfa.c &&\
> mv -f $depbase.Tpo $depbase.Po
>
> dfa.c:1627:19: warning: suggest braces around initialization of subobject
>   [-Wmissing-braces]
>   mbstate_t s = { 0 };
>   ^
>   {}
> 1 warning generated.
>
> Also two similar warnings from localinfo.c, lines 47 and 103.
> [Note: despite being invoked as 'gcc' the compiler is 'clang'
> and not a particularly recent version.]
>
> Even with warnings, the build completed successfully.
> All tests passed.
>



Re: malloc/dynarray-skeleton.c problems on MacOS 10.13.6 and 10.15.7

2021-05-07 Thread arnold
Paul Eggert  wrote:

> On 5/6/21 11:23 PM, arn...@skeeve.com wrote:
> > I'd prefer to see it fixed upstream...
>
> It was fixed upstream a couple of weeks ago. You should be able to fix 
> the Gawk issue by syncing Gawk from Gnulib.

Thanks, will do.

Arnold



Re: malloc/dynarray-skeleton.c problems on MacOS 10.13.6 and 10.15.7

2021-05-07 Thread arnold
Jeffrey Walton  wrote:

> On Fri, May 7, 2021 at 2:13 AM  wrote:
> >
> > Hi Paul & Jim,
> >
> > Please see the report below from Nelson Beebe, attempting to build
> > https://www.skeeve.com/gawk/gawk-5.1.1a.tar.gz on recent MacOS.
> > This a test release, working towards a real release.
> >
> > Can you pleae work directly with Nelson in terms of patches etc and
> > then let me know when Gnulib is updated?
>
> I can't speak for others, but I solve it with:
>
> grep -IR nonnull ./* | cut -f 1 -d ':' | sort | uniq
>
> Then, for each file in the list:
>
> sed -e 's/__nonnull ((1))//g' \
> -e 's/__nonnull ((1, 2))//g' \
> "${file}" > "${file}.fixed"
> mv "${file}.fixed" "${file}"
>
> (You will need it for more than Awk. Wget and a couple of others need
> the treatment, too).
>
> Jeff

Thanks. I can do this, but I'd prefer to see it fixed upstream...

Arnold



malloc/dynarray-skeleton.c problems on MacOS 10.13.6 and 10.15.7

2021-05-07 Thread arnold
Hi Paul & Jim,

Please see the report below from Nelson Beebe, attempting to build
https://www.skeeve.com/gawk/gawk-5.1.1a.tar.gz on recent MacOS.
This a test release, working towards a real release.

Can you pleae work directly with Nelson in terms of patches etc and
then let me know when Gnulib is updated?

Much thanks,

Arnold

> Date: Thu, 6 May 2021 15:15:08 -0600
> From: "Nelson H. F. Beebe" 
> Subject: Re: Release spiral start
>
> I got successful builds and installations of gawk-5.1.1a on several
> systems, include CentOS 5/6/7/8, Ubuntu 20.04, Debian 7 (MIPS32),
> Oracle 8, Red Hat 8, and Solaris 10, but build attempts on macOS
> 10.13.6 and 10.15.7 with both Apple's /usr/bin/cc (really clang 12)
> and /usr/bin/gcc (clang 12 masquerading as gcc), plus on macOS
> 10.13.6, gcc-{4,5,6,7,8,9,10,11,12}, have all failed.
>
> The killer code is the several instances of __nonnull ((1)) in
> support/malloc/dynarray-skeleton.c.  They produce a cascade of errors
> that start like this:
>
>   /usr/bin/gcc -DGAWK -DHAVE_CONFIG_H -I"./.." -I. -I..\
>  -g -O2 -DNDEBUG -g -O2 -DNDEBUG -MT regex.o \
>  -MD -MP -MF $depbase.Tpo -c -o regex.o  \
>  regex.c &&  \
>   mv -f $depbase.Tpo $depbase.Po
> In file included from regex.c:74:
> In file included from ./regexec.c:1368:
> ./malloc/dynarray-skeleton.c:195:13: error: expected ')'
> __nonnull ((1))
> ^
> ./malloc/dynarray-skeleton.c:195:12: note: to match this '('
> __nonnull ((1))
>^
> ./malloc/dynarray-skeleton.c:195:13: warning: type specifier missing, 
> defaults to 'int' [-Wimplicit-int]
> __nonnull ((1))
> ^
> ./malloc/dynarray-skeleton.c:195:1: warning: type specifier missing, defaults 
> to 'int' [-Wimplicit-int]
> __nonnull ((1))
>
> ... many more ...
>
> ---
> - Nelson H. F. BeebeTel: +1 801 581 5254  
> -
> - University of UtahFAX: +1 801 581 4148  
> -
> - Department of Mathematics, 110 LCBInternet e-mail: be...@math.utah.edu  
> -
> - 155 S 1400 E RM 233   be...@acm.org  be...@computer.org 
> -
> - Salt Lake City, UT 84112-0090, USAURL: http://www.math.utah.edu/~beebe/ 
> -
> ---



Re: current gnulib regex breaks in gawk

2021-04-22 Thread arnold
I have pushed fixes for this. Let me know if there are still issues.

Thanks,

Arnold

arn...@skeeve.com wrote:

> "Dmitry V. Levin"  wrote:
>
> > On Sat, Apr 17, 2021 at 01:43:58PM -0600, arn...@skeeve.com wrote:
> > > "Dmitry V. Levin"  wrote:
> > > 
> > > > I've just tried to build the latest commit gawk-5.1.0-260-gde598391 from
> > > > gawk-5.1-stable branch.  Unfortunately, the result executable uses a
> > > > private glibc interface:
> > > > $ nm gawk |grep GLIBC_PRIVATE
> > > >  U __libc_dynarray_resize@GLIBC_PRIVATE
> > > > This makes it unusable at least in GNU/Linux distributions.
> > > 
> > > Can you explain how this makes it unusable?  I see this on Ubuntu
> > > but the gawk executables run just fine.
> > > 
> > > What, really, is the problem here?  I don't understand.
> >
> > Well, GLIBC_PRIVATE is a private glibc interface intended for use by
> > various parts of glibc itself, it can change (and does change from time
> > to time) without providing backwards compatibility, any symbol in
> > GLIBC_PRIVATE can disappear or change its semantics during glibc update.
> > Consequently, packages are not allowed to have dependencies on
> > GLIBC_PRIVATE.
>
> So, the problem is that __libc_dynarray_resize is actually not linked
> into gawk, but the executable runs because the local libc happens to
> supply the symbol.  But since it's "private" to GLIBC, that symbol
> being there can't be relied upon.
>
> OK --- I will work on this.
>
> Thanks,
>
> Arnold



Re: current gnulib regex breaks in gawk

2021-04-17 Thread arnold
"Dmitry V. Levin"  wrote:

> On Sat, Apr 17, 2021 at 01:43:58PM -0600, arn...@skeeve.com wrote:
> > "Dmitry V. Levin"  wrote:
> > 
> > > I've just tried to build the latest commit gawk-5.1.0-260-gde598391 from
> > > gawk-5.1-stable branch.  Unfortunately, the result executable uses a
> > > private glibc interface:
> > > $ nm gawk |grep GLIBC_PRIVATE
> > >  U __libc_dynarray_resize@GLIBC_PRIVATE
> > > This makes it unusable at least in GNU/Linux distributions.
> > 
> > Can you explain how this makes it unusable?  I see this on Ubuntu
> > but the gawk executables run just fine.
> > 
> > What, really, is the problem here?  I don't understand.
>
> Well, GLIBC_PRIVATE is a private glibc interface intended for use by
> various parts of glibc itself, it can change (and does change from time
> to time) without providing backwards compatibility, any symbol in
> GLIBC_PRIVATE can disappear or change its semantics during glibc update.
> Consequently, packages are not allowed to have dependencies on
> GLIBC_PRIVATE.

So, the problem is that __libc_dynarray_resize is actually not linked
into gawk, but the executable runs because the local libc happens to
supply the symbol.  But since it's "private" to GLIBC, that symbol
being there can't be relied upon.

OK --- I will work on this.

Thanks,

Arnold



Re: current gnulib regex breaks in gawk

2021-04-17 Thread arnold
"Dmitry V. Levin"  wrote:

> I've just tried to build the latest commit gawk-5.1.0-260-gde598391 from
> gawk-5.1-stable branch.  Unfortunately, the result executable uses a
> private glibc interface:
> $ nm gawk |grep GLIBC_PRIVATE
>  U __libc_dynarray_resize@GLIBC_PRIVATE
> This makes it unusable at least in GNU/Linux distributions.

Can you explain how this makes it unusable?  I see this on Ubuntu
but the gawk executables run just fine.

What, really, is the problem here?  I don't understand.

Thanks,

Arnold



Re: current gnulib regex breaks in gawk

2021-04-17 Thread arnold
Thanks for the report.  What causes the interface to be marked
as GLIBC_PRIVATE?

I don't have the issue you report on either Ubuntu 18.04 or 20.04,
which are the main systems I develop on.  I will try to look into
this some.

> I wish gawk sources used some gnulib module import automation, e.g.
> gnulib-tool script, like many other gnulib users do, that would make
> updating gnulib modules a relatively straightforward task.

Sorry to disappoint you, but I prefer to keep my project such
that the support infrastructure doesn't overwhelm the actual
project code.

Arnold

"Dmitry V. Levin"  wrote:

> Hi Arnold,
>
> On Sun, Feb 07, 2021 at 11:36:29PM -0700, arn...@skeeve.com wrote:
> > arn...@skeeve.com wrote:
> > 
> > > I still have to have the following change, otherwise I get a linkage
> > > error on the gl_dyanarray_* routines. :-(
> > >
> > > So, at least for the nonce, my copy and Gnulib's will be out of sync.
> > > Oh well.
> > 
> > So actually, I've managed to work around this issue too. So the files
> > are back in sync. Whew!
>
> I've just tried to build the latest commit gawk-5.1.0-260-gde598391 from
> gawk-5.1-stable branch.  Unfortunately, the result executable uses a
> private glibc interface:
> $ nm gawk |grep GLIBC_PRIVATE
>  U __libc_dynarray_resize@GLIBC_PRIVATE
> This makes it unusable at least in GNU/Linux distributions.
>
> Such an unfortunate result is due to very unusual method used to integrate
> dynarray module from gnulib into gawk:
> - unlike gnulib's lib/dynarray.h, gawk's support/dynarray.h is empty;
> - gnulib's lib/malloc/dynarray_resize.c is not imported into gawk's
>   support/malloc/ at all.
>
> I was able to make an ad-hoc fix by replacing gawk's support/dynarray.h
> with gnulib's lib/dynarray.h, importing gnulib's
> lib/malloc/dynarray_resize.c as support/malloc/dynarray_resize.c,
> and adding malloc/dynarray_resize.c to libsupport_a_SOURCES of
> support/Makefile.am, hope this helps.
>
> I wish gawk sources used some gnulib module import automation, e.g.
> gnulib-tool script, like many other gnulib users do, that would make
> updating gnulib modules a relatively straightforward task.
>
>
> -- 
> ldv



Re: current gnulib regex breaks in gawk

2021-02-08 Thread arnold
arn...@skeeve.com wrote:

> I still have to have the following change, otherwise I get a linkage
> error on the gl_dyanarray_* routines. :-(
>
> So, at least for the nonce, my copy and Gnulib's will be out of sync.
> Oh well.

So actually, I've managed to work around this issue too. So the files
are back in sync. Whew!

Thanks,

Arnold



Re: current gnulib regex breaks in gawk

2021-02-08 Thread arnold
Hi Bruno.

> 1) It chokes on a missing definition of macro _GL_ATTRIBUTE_FALLTHROUGH.
>
> Can you add this piece of text to a common .h file?
>
> #if 201710L < __STDC_VERSION__
> # define _GL_ATTRIBUTE_FALLTHROUGH [[__fallthrough__]]
> #elif _GL_HAS_ATTRIBUTE (fallthrough)
> # define _GL_ATTRIBUTE_FALLTHROUGH __attribute__ ((__fallthrough__))
> #else
> # define _GL_ATTRIBUTE_FALLTHROUGH ((void) 0)
> #endif

Fixed, in a slightly different fashion.

> It seems you are not using gnulib's cdefs.h? You need both lib/libc-config.h
> and lib/cdefs.h.

That helped a lot.

I still have to have the following change, otherwise I get a linkage
error on the gl_dyanarray_* routines. :-(

So, at least for the nonce, my copy and Gnulib's will be out of sync.
Oh well.

Thanks,

Arnold
-
--- /usr/local/src/Gnu/gnulib/lib/regex_internal.h  2021-02-08 
07:51:05.352326126 +0200
+++ regex_internal.h2021-02-08 08:06:15.938934924 +0200
@@ -32,7 +32,7 @@
 #include 
 #include 
 
-#ifndef _LIBC
+#if !defined(_LIBC) && !defined(GAWK)
 # include 
 #endif
 



Re: current gnulib regex breaks in gawk

2021-02-07 Thread arnold


binlydUstJOzd.bin
Description: Binary data


current gnulib regex breaks in gawk

2021-02-07 Thread arnold
Hi.

I happened to notice that regex has been updated with new, er, stuff.

Dropping the code into gawk, including copying over attribute.h,
dynarray.h and malloc/*, doesn't work. Compilation chokes.

I have not yet investigated what the changes are, but I have to wonder
if the churn is really needed?

Running gnulibtool on gawk isn't the direction I want to go, either...

Thanks,

Arnold



Re: dfa.c change - please revert

2020-07-24 Thread arnold
Thanks.

A few times a week I 'git pull' to see if anything has changed
that affects gawk.

Arnold

Bruno Haible  wrote:

> Hi Arnold,
>
> > Please revert this, as it breaks compilation in gawk.
>
> This patch should do it (keeping the optimized variant of the 3-way 
> comparison).
>
> Btw, how did you notice the breakage so rapidly? Are you scanning the commits
> or the mails, or do you have a continuous integration?
>
>
> 2020-07-24  Bruno Haible  
>
>   dfa: Revert breaking gawk.
>   Reported by Arnold Robbins .
>   * lib/dfa.c (compare): Don't reference the _GL_CMP macro.
>
> diff --git a/lib/dfa.c b/lib/dfa.c
> index 1d2d404..e79d882 100644
> --- a/lib/dfa.c
> +++ b/lib/dfa.c
> @@ -2466,7 +2466,7 @@ static int
>  compare (const void *a, const void *b)
>  {
>position const *p = a, *q = b;
> -  return _GL_CMP (p->index, q->index);
> +  return (p->index > q->index) - (p->index < q->index);
>  }
>  
>  static void



dfa.c change - please revert

2020-07-24 Thread Arnold Robbins
Hi.

| diff --git a/lib/dfa.c b/lib/dfa.c
| index dee7be861..1d2d40457 100644
| --- a/lib/dfa.c
| +++ b/lib/dfa.c
| @@ -2466,7 +2466,7 @@ static int
|  compare (const void *a, const void *b)
|  {
|position const *p = a, *q = b;
| -  return p->index < q->index ? -1 : p->index > q->index;
| +  return _GL_CMP (p->index, q->index);
|  }

Please revert this, as it breaks compilation in gawk.

Thanks,

Arnold



Re: dfa.c no longer usable if no 64-bit support

2020-02-09 Thread arnold
Just FYI, gawk's dfa.c is now in sync w/Gnulib's. 

There are still some problems on Vax/VMS. I suspect it's environmental
but will let you know if not.

Thanks!

Arnold

arn...@skeeve.com wrote:

> Paul,
>
> Thanks for this.  I will work on reducing the differences between
> what's in Gnulib and what's in gawk.
>
> Vax/VMS is dead as a commercial system, true. But it remains alive as
> a hobbyist system, especially as it's very easy to run in simulation
> under SIMH.
>
> Thanks!
>
> Arnold
>
> Paul Eggert  wrote:
>
> > On 1/29/20 7:34 AM, Bruno Haible wrote:
> > > I would say that it's not worth the effort - except for the person(s)
> > > who care a lot about Vax/VMS.
> >
> > Normally I'd agree, but if Arnold cares about VAX/VMS and if we want 
> > Gnulib dfa.c to match Gawk dfa.c, then in this particular case it makes 
> > some sense to support 32-bit-only platforms, as it's easy to revert the 
> > recent patch that made dfa.c assume 64-bit. So I installed the attached.
> >
> > However, I see some other parts of departure for Gawk dfa.c:
> >
> > * Gawk dfa.c/dfa.h does not use flexible array members or the 
> > portable-to-7th-edition-Unix substitute provided by Gnulib, so I suggest 
> > that Gawk import Gnulib lib/flexmember.h, and either "#define 
> > FLEXIBLE_ARRAY_MEMBER 1" in config.h or (better) import Gnulib 
> > m4/flexmember.m4.
> >
> > * Gawk dfa.c doesn't use isblank, but instead defines its own is_blank 
> > that is hard-coded to the C locale. Isn't [[:blank:]] supposed to be 
> > locale-dependent? Or are you assuming that space and tab are the only 
> > blank characters in all single-byte locales?
> >
> > * Gawk dfa.c includes mbsupport.h if __DJGPP__ is defined. I suggest 
> > moving this to Gawk config.h so that dfa.c need not worry about it.
> >
> > * Gawk dfa.c replaces "#include " with:
> >
> > #ifndef VMS
> > #include 
> > #else
> > #define SIZE_MAX __INT32_MAX
> > #define PTRDIFF_MAX __INT32_MAX
> > #endif
> >
> > I suppose we could add something like this to Gnulib dfa.c but it's a 
> > bit ugly; is there a cleaner way to do it? Perhaps Gawk could supply its 
> > own little substitute stdint.h on VMS. (Gnulib does this too but I 
> > assume Gnulib's stdint.h is too heavyweight for Gawk.)



Re: dfa.c no longer usable if no 64-bit support

2020-01-30 Thread arnold
Paul,

Thanks for this.  I will work on reducing the differences between
what's in Gnulib and what's in gawk.

Vax/VMS is dead as a commercial system, true. But it remains alive as
a hobbyist system, especially as it's very easy to run in simulation
under SIMH.

Thanks!

Arnold

Paul Eggert  wrote:

> On 1/29/20 7:34 AM, Bruno Haible wrote:
> > I would say that it's not worth the effort - except for the person(s)
> > who care a lot about Vax/VMS.
>
> Normally I'd agree, but if Arnold cares about VAX/VMS and if we want 
> Gnulib dfa.c to match Gawk dfa.c, then in this particular case it makes 
> some sense to support 32-bit-only platforms, as it's easy to revert the 
> recent patch that made dfa.c assume 64-bit. So I installed the attached.
>
> However, I see some other parts of departure for Gawk dfa.c:
>
> * Gawk dfa.c/dfa.h does not use flexible array members or the 
> portable-to-7th-edition-Unix substitute provided by Gnulib, so I suggest 
> that Gawk import Gnulib lib/flexmember.h, and either "#define 
> FLEXIBLE_ARRAY_MEMBER 1" in config.h or (better) import Gnulib 
> m4/flexmember.m4.
>
> * Gawk dfa.c doesn't use isblank, but instead defines its own is_blank 
> that is hard-coded to the C locale. Isn't [[:blank:]] supposed to be 
> locale-dependent? Or are you assuming that space and tab are the only 
> blank characters in all single-byte locales?
>
> * Gawk dfa.c includes mbsupport.h if __DJGPP__ is defined. I suggest 
> moving this to Gawk config.h so that dfa.c need not worry about it.
>
> * Gawk dfa.c replaces "#include " with:
>
> #ifndef VMS
> #include 
> #else
> #define SIZE_MAX __INT32_MAX
> #define PTRDIFF_MAX __INT32_MAX
> #endif
>
> I suppose we could add something like this to Gnulib dfa.c but it's a 
> bit ugly; is there a cleaner way to do it? Perhaps Gawk could supply its 
> own little substitute stdint.h on VMS. (Gnulib does this too but I 
> assume Gnulib's stdint.h is too heavyweight for Gawk.)



dfa.c no longer usable if no 64-bit support

2020-01-29 Thread arnold
Hi.

The gentleman who maintains the gawk port for VMS reports that he
can get dfa.c to compile on Vax/VMS, but that he gets failues when
trying to use it to compile regular expressions.

The Vax/VMS C compiler does not support 64 bit integers at all
(unlike GCC on 32-bit x86, for example).

This may not be a blocker, but even if not, disabling use of dfa.c for
regular expression matching means that gawk will run slower on that
system.

Can dfa.c be made 32-bit compatibile in a happy fashion?

Thanks,

Arnold



Re: [PATCH] regex: port to Gawk on nonstandard platforms

2020-01-27 Thread arnold
Paul Eggert  wrote:

> On 1/26/20 1:42 AM, arn...@skeeve.com wrote:
> > And then in places in regcomp.c BITSET_WORD_BITS is tested in
> > several #if/#elif statements.
>
> Ouch, I hadn't noticed that. It's exercised only on non-GCC platforms 
> that don't support INT_WIDTH etc., which is why I didn't see it in my 
> testing. I installed the first attached patch, which should fix it. 
> Thanks for reporting it.
>
> While I was at it I also installed the second attached patch, since the 
> regex code no longer depends on the limits-h module. This second patch 
> shouldn't affect Awk.

Much thanks for the fix. I have pulled it into gawk and we'll see
what my testers report.

Thanks,

Arnold



Re: [PATCH] regex: port to Gawk on nonstandard platforms

2020-01-26 Thread arnold
Hi. Paul.

> diff --git a/lib/regex_internal.h b/lib/regex_internal.h
> index 13e15e21e..6d436fde1 100644
> --- a/lib/regex_internal.h
> +++ b/lib/regex_internal.h
> @@ -141,6 +141,9 @@
>  #ifndef SSIZE_MAX
>  # define SSIZE_MAX ((ssize_t) (SIZE_MAX / 2))
>  #endif
> +#ifndef ULONG_WIDTH
> +# define ULONG_WIDTH (CHAR_BIT * sizeof (unsigned long int))
> +#endif
>  
>  /* The type of indexes into strings.  This is signed, not size_t,
> since the API requires indexes to fit in regoff_t anyway, and using

This change is problematic.  Further on in regex_internal.h we
have

#define BITSET_WORD_BITS ULONG_WIDTH

And then in places in regcomp.c BITSET_WORD_BITS is tested in
several #if/#elif statements.

Thus on systems that don't provide ULONG_WIDTH, we end up with
expressions in #if/#elif that wants to use sizeof.

Needless to say, this fails spectactularly. :-(

Can you revert to the original code or to something else that
will compile on systems where ULONG_WIDTH is not defined?

Much thanks,

Arnold



Re: regex.c needs ULONG_WIDTH, not in standard limits.h

2020-01-24 Thread arnold
Paul Eggert  wrote:

> Actually ULONG_WIDTH is part of C standard limits.h. However, not every
> platform conforms to the current standard. The Gnulib stdlib-h module
> works around this portability issue, but unfortunately Awk is not using
> stdlib-h so I installed the attached.

THANK YOU for the quick turnaround time on the fix.  I appreciate it.

Arnold



regex.c needs ULONG_WIDTH, not in standard limits.h

2020-01-24 Thread Arnold Robbins
Hi.

I just pulled a new copy of regex.c and tried to drop it into gawk. It
fails with ULONG_WIDTH undeclared.

It seems to be in Gnulib's limits.h replacement, but I'm not using that,
I rely on the standard limits.h.

Can this be fixed, or the change that uses it be reverted, please?

Thanks,

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-20 Thread arnold
Paul Eggert  wrote:

> On 12/16/19 2:12 AM, arn...@skeeve.com wrote:
> > What about
> > 
> > typedef ptrdiff_t dfa_size_t
>
> That declaration would imply that the type is specific to DFAs. However, the
> type is used (with exactly the same meaning) in a lot of other places. This is
> why I used the more-generic name "idx_t" internally dfa.c.

I give up. Leave it ptrdiff_t.  I may submit comment changes for dfa.h
later.

Arnold



Re: dfa.c badly broken when dropped into gawk

2019-12-18 Thread arnold
Paul Eggert  wrote:

> On 12/15/19 10:43 AM, Arnold Robbins wrote:
> > To reproduce:
> > 
> > 1. Checkout the gawk repo
> > 2. Copy gnulib/lib/dfa.[ch] into gawk/support/.
> > 3. Apply the minimal patch below
>
> I looked into that, and the problem was not in Gnulib; it was that your
> minimal patch's dfasyntax didn't clear its destination properly. (Gawk
> master dfa.c diverged from Gnulib in having its dfaalloc use xzalloc
> rather than xmalloc, and the minimal patch didn't capture that
> divergence.)

Ooops. But somehow when I copied the changes from dfa.c into my
copy, it was still failing. I sent the miminmal patch to make
it easier for you to debug.

> It seems to be error-prone that we're continuing to maintain a separate
> copy of dfa.c for Gawk, so I suggest we unify the two copies. I attempted
> to do that by installing the attached patches into Gnulib. You should
> now be able to use Gnulib dfa.c as follows:
>
> 1. Checkout the gawk repo.
> 2. Copy gnulib/lib/dfa.[ch] and gnulib/lib/localeinfo.[ch] into gawk/support/.
>
> Then build as usual. This works for me on GNU/Linux.

Thanks.  I haven't pushed yet, but my copy of dfa.c is now MUCH
closer to yours and still working ok under Linux.  I have some things
to check out with my porting team. It may be possible to move
my copy even closer to the one in gnulib.  If we can get to the point
where we're only using a single copy, that'd be really good.

I appreciate the help.

Arnold



Re: dfa MT-safe?

2019-12-16 Thread arnold
Paul Eggert  wrote:

> On 12/15/19 4:43 AM, arn...@skeeve.com wrote:
> > On the assumption that setlocale is the only blocker, I would rather
> > see an additional `char *locale_name' parameter added to dfa_syntax.
>
> Thanks, this is a good suggestion. Running with it, we can improve it further 
> by
> putting this new flag into struct localeinfo (so no need for a new dfasyntax
> parameter), and also I think I can initialize it without using setlocale (so 
> no
> need to worry about setlocale's lack of thread-safety). I installed the 
> attached
> patches and they work with 'grep'. This way, we shouldn't need to pull in any
> more Gnulib code into Gawk.
>
> I plan to take a look at the more serious crashes soon.

I will manually apply these changes and test, but they look
reasonable to me.

I won't be able to merge from gnulib until the dfa crashes are
dealt with, though.  IMHO those are very high priority.

Thanks,

Arnold



dfa.c badly broken when dropped into gawk

2019-12-15 Thread Arnold Robbins
Hi.

The current dfa.[ch] are badly broken when dropped into gawk.  To
reproduce:

1. Checkout the gawk repo
2. Copy gnulib/lib/dfa.[ch] into gawk/support/.
3. Apply the minimal patch below

Then the usual `./bootstrap.sh && ./configure && make -j && make check'.
You'll see lots of the tests blowing up spectacularly.

Please repair things.

Thanks,

Arnold

diff --git a/lib/dfa.c b/lib/dfa.c
index 8c88c9d..818f58f 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -890,6 +890,23 @@ char_context (struct dfa const *dfa, unsigned char c)
   return CTX_NONE;
 }
 
+/* Copy the syntax settings from one dfa instance to another.
+   Saves considerable computation time if compiling many regular expressions
+   based on the same setting.  */
+void
+dfacopysyntax (struct dfa *to, const struct dfa *from)
+{
+  to->dfaexec = from->dfaexec;
+  to->simple_locale = from->simple_locale;
+  to->localeinfo = from->localeinfo;
+
+  to->fast = from->fast;
+
+  to->canychar = from->canychar;
+  to->lex.cur_mb_len = from->lex.cur_mb_len;
+  to->syntax = from->syntax;
+}
+
 /* Set a bit in the charclass for the given wchar_t.  Do nothing if WC
is represented by a multi-byte sequence.  Even for MB_CUR_MAX == 1,
this may happen when folding case in weird Turkish locales where
diff --git a/lib/dfa.h b/lib/dfa.h
index 96c3bf1..c6dc786 100644
--- a/lib/dfa.h
+++ b/lib/dfa.h
@@ -42,7 +42,7 @@ struct dfa;
 /* Allocate a struct dfa.  The struct dfa is completely opaque.
The returned pointer should be passed directly to free() after
calling dfafree() on it. */
-extern struct dfa *dfaalloc (void) _GL_ATTRIBUTE_MALLOC;
+extern struct dfa *dfaalloc (void) /* _GL_ATTRIBUTE_MALLOC */ ;
 
 /* DFA options that can be ORed together, for dfasyntax's 4th arg.  */
 enum
@@ -105,6 +105,11 @@ extern struct dfa *dfasuperset (struct dfa const *d) 
_GL_ATTRIBUTE_PURE;
 /* The DFA is likely to be fast.  */
 extern bool dfaisfast (struct dfa const *) _GL_ATTRIBUTE_PURE;
 
+/* Copy the syntax settings from one dfa instance to another.
+   Saves considerable computation time if compiling many regular expressions
+   based on the same setting.  */
+extern void dfacopysyntax (struct dfa *to, const struct dfa *from);
+
 /* Free the storage held by the components of a struct dfa. */
 extern void dfafree (struct dfa *);
 



Re: dfa MT-safe?

2019-12-15 Thread arnold
Hi.

Bruno Haible  wrote:

> > In any case, gawk's use of it is (and will remain) single-threaded.
> > It'd be nice if your fix did not pull in more libraries, like libpthread
> > or whatever, since that would considerably complicate things for me,
> > for no actual gain w.r.t. gawk.
>
> If you add these two lines to configure.ac:
>   gl_cv_func_setlocale_null_all_mtsafe=yes
>   gl_cv_func_setlocale_null_one_mtsafe=yes
> no additional libraries will be needed.

How? I don't use gnulib in gawk.

> > I'm curious what is the use case for multithreaded dfa?
>
> One could speed up
>   grep -r PATTERN DIRECTORY_WITH_MANY_FILES
> by a large factor (probably 4x or 5x, on a CPU with 8 threads).
> This would be done by modifying 'grep' to process each file in a
> separate thread. The kernel can feed the data of these files to 'grep'
> in parallel. Only the output phase needs to serialize things.

I suspect that exactly because of the output phase you won't see
such a huge speedup in practice, but it's worth a shot.

On the assumption that setlocale is the only blocker, I would rather
see an additional `char *locale_name' parameter added to dfa_syntax.
That way the caller can get the value and pass it in, and the
dfa code becomes mt-safe at next to no cost.

Thanks,

Arnold



Re: dfa MT-safe?

2019-12-15 Thread arnold


On 12/14/19 3:43 AM, Bruno Haible wrote:
> > If the 'dfa' module supposed to be multithread-safe?

Paul Eggert  wrote:
> Yes it is supposed to be.

News to me.

In any case, gawk's use of it is (and will remain) single-threaded.
It'd be nice if your fix did not pull in more libraries, like libpthread
or whatever, since that would considerably complicate things for me,
for no actual gain w.r.t. gawk.

I'm curious what is the use case for multithreaded dfa?

Thanks,

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-15 Thread arnold
OK. I skimmed the links.  But why not write the code to say what
we mean?  For example:

#include 
typedef int64_t dfa_size_t;

extern void dfaparse (char const *, dfa_size_t, struct dfa *);
extern void dfacomp (char const *, dfa_size_t, struct dfa *, bool);
  bool allow_nl, dfa_size_t *count, bool *backref);

Using ptrdiff_t directly simply because it is defined to be the
largest signed integer remains ugly (and Paul has already moved to
a typedef in the implementation.)

int64_t is just as standard as ptrdiff_t and just as clear.

Thanks,

Arnold

Paul Eggert  wrote:

> >> I see that Paul has made the change to the API over my objections.
>
> I made the change while responding to Bruno's objections, but before 
> seeing yours. Ooops. Sorry about that. However, I hope the followup 
> emails have addressed your comments, at least to some extent.
>
> > Paul, can you point to a link that lists the benefits/tradeoffs? If I
> > had such a link handy, I would have provided it here.
>
> Avoiding unsigned types for indexes and sizes seems to be a growing 
> movement. Admittedly there are arguments for unsigned, but these 
> arguments are getting weaker with time. Here are a couple of links, the 
> first for C and the second for C++:
>
> https://www.gnu.org/software/emacs/manual/html_node/elisp/C-Integer-Types.html
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1428r0.pdf
>
> As for ssize_t vs ptrdiff_t: ssize_t is less central to the C language 
> (ptrdiff_t is in the C standard but ssize_t is not). And ssize_t is less 
> convenient: for example, there's no simple, portable way to printf an 
> ssize_t value, as there is with "%td" and ptrdiff_t. So there are 
> technical reasons for preferring ptrdiff_t to ssize_t for this sort of 
> thing (even though "ssize_t" is a shorter and better name). Thich is why 
> Emacs, other parts of Gnulib, and other Gnu applications have used 
> ptrdiff_t instead of ssize_t for this sort of thing.



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-13 Thread arnold
arn...@skeeve.com wrote:

> But I really don't want ptrdiff_t in the API.

I see that Paul has made the change to the API over my objections.

Jim --- do you have an opinion on this?

Thanks,

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-13 Thread arnold
Hi Paul.

Paul Eggert  wrote:

> On 12/11/19 11:31 PM, arn...@skeeve.com wrote:
>
> > 1,$s/ptrdiff_t/ssize_t/g
>
> ssize_t can be narrower than ptrdiff_t, so it's not a good type to use 
> for this notion. Its original motivation was "the type that 'read' 
> returns", and on systems where 'read' can return at most INT_MAX, 
> ssize_t can be 32 bits even if size_t is 64 bits.

In practice, how many system are there where ssize_t is 32 bits and size_t
is 64? If that number is <= 5 then I wouldn't worry about using ssize_t.

In any case, as I said, I can live with ptrdiff_t in the implementation,
even though I don't like it that much.  (A nice block comment at the
top of dfa.c explaining why ptrdiff_t is used would be appropriate.)

But I really don't want ptrdiff_t in the API.

Thanks,

Arnold

Thanks,

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-11 Thread arnold
arn...@skeeve.com wrote:

> Other than this, I think internally too, I'd prefer that you
>
>   1,$s/ptrdiff_t/ssize_t/g

I did this, just to see. gawk passes its test suite, both in
64- and 32-bit mode.

FWIW.

Thanks,

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-11 Thread arnold
Hi Paul.

Paul Eggert  wrote:

> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00058.html
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00059.html

Looking at this:

| @@ -1733,11 +1733,11 @@ add_utf8_anychar (struct dfa *dfa)
|  /* f0-f7: 4-byte sequence.  */
|  CHARCLASS_INIT (0, 0, 0, 0, 0, 0, 0, 0xff)
|};
| -  const unsigned int n = sizeof (utf8_classes) / sizeof (utf8_classes[0]);
| +  int n = sizeof utf8_classes / sizeof *utf8_classes;

Why are you throwing away const here?

Other than this, I think internally too, I'd prefer that you

1,$s/ptrdiff_t/ssize_t/g

(and fix any printf calls).  It just feels like an abuse of
the type, which is for representing differences between pointers,
and not regular large signed integeers.

However, I'm not going to insist about it internally, whereas
I would object strongly to the use of ptrdiff_t in the API.

Thanks!

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-12-11 Thread arnold
Hi Paul.

Paul Eggert  wrote:

> On 3/22/19 7:49 PM, Norihiro Tanaka wrote:
> > Missing a patch for dfa.  Re-send correct patch file.
>
> Thanks, I installed the DFA-relevant parts of your proposed fix into 
> Gnulib. (The grep parts still need doing.) I also installed the attached 
> commentary followup.
>
> While I was at it I installed a patch to fix an unlikely integer 
> overflow that I noticed while reviewing your fix. I also installed some 
> internal changes to prefer signed to unsigned integers for indexes, as 
> this should make future integer overflows easier to catch. See:
>
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00058.html
> https://lists.gnu.org/r/bug-gnulib/2019-12/msg00059.html

I am reviewing these. In general using signed integers internally
looks OK to me.

> I'd also like to change dfa.h's API to prefer ptrdiff_t to size_t, for 
> the same integer-overflow reason. This would be a (minor) API change so 
> I thought I'd ask first. Any objections?

Yes. I object. Strongly.

We're passing length and count values and those are supposed
to be size_t.  If you REALLY want signed values, then I could
live with ssize_t (as returned by read(2), for example), but I
would find ptrdiff_t to be ugly and unintuitive.

> PS. Arnold, the above discusses all the changes I know about for dfa.c 
> and dfa.h. The proposed API change (size_t->ptrdiff_t) could be 
> installed either before or after the next Gawk release.

Thanks. I'm skimming the other changes now.

Arnold



Re: Adapting changes for MSYS2?

2019-11-16 Thread arnold
Bruno Haible  wrote:

> What I could see
>   - from https://github.com/msys2/msys2/wiki/How-does-MSYS2-differ-from-Cygwin
>   - from analysis of a couple of gnulib test failures
> is that MSYS2, compared to Cygwin, has problems in the area of file 
> permissions,
> symbolic links, and signals (at least).
>
> Since anyone can compile for mingw32, mingw64, MSVC 32-bit, and MSVC 64-bit
> using Cygwin [1], I don't see the point of investing effort into making GNU
> packages build fine on MSYS2.

OTOH, if it's not a lot of work to upstream a few patches, why not?
The more environments that I can easily support out of the box, the
better for my users.

My two cents,

Arnold



Re: Adapting changes for MSYS2?

2019-11-10 Thread arnold
Hi Paul.

Much thanks! I have pulled in the changes to gawk and pushed to git.
I await news from the original reporter (Hi Peter!) as to whether that
does the trick on msys.

W.R.T. your question about config.guess, perhaps Alexey can answer.

Thanks!

Arnold

Paul Eggert  wrote:

> On 11/9/19 10:40 AM, arn...@skeeve.com wrote:
>
> > A gawk user recently called my attention to:
> > 
> > https://github.com/msys2/MSYS2-packages/tree/master/gawk
> > 
> >> The patch file there is named gawk-4.2.1-msysize.patch.  The patches mainly
> >> seem to add an "msys*" option in several build scripts just after the
> >> "cygwin*" system identity alternatives.
> > 
> > I have adapted the patches for gawk's test/Makefile.am and will be
> > pushing shortly.
> > 
> > Can we get the changes for compile, config.guess, config.rpath, and
> > ar-lib integrated directly into GNULIB, so that I can then pull them
> > from upstream?
>
> I installed most of those changes to Gnulib. The exception is the proposed 
> change to config.guess, which is upstream from Gnulib and is part of GNU 
> config, 
> so I'll CC: this message to config-patc...@gnu.org.
>
> As far as I can see, the proposed config.guess change has no effect, as the 
> existing config.guess treats MSYS the same on all architectures in an earlier 
> branch of that big 'case' statement. config.guess has done so since this 
> commit 
> in 2014, which specifically caused config.guess to treat MSYS the same on all 
> machines, not just i* machines:
>
> https://git.savannah.gnu.org/cgit/config.git/commit/?id=f4ebd3ed097771a729b68e688236aea665e7c1f3
>
> so I'm puzzled as to why the config.guess change would be needed even if it 
> were 
> effective.



Adapting changes for MSYS2?

2019-11-09 Thread arnold
Hi.

A gawk user recently called my attention to:

https://github.com/msys2/MSYS2-packages/tree/master/gawk

> The patch file there is named gawk-4.2.1-msysize.patch.  The patches mainly
> seem to add an "msys*" option in several build scripts just after the
> "cygwin*" system identity alternatives.

I have adapted the patches for gawk's test/Makefile.am and will be
pushing shortly.

Can we get the changes for compile, config.guess, config.rpath, and
ar-lib integrated directly into GNULIB, so that I can then pull them
from upstream?

They may be small enough that you don't need paperwork.

I'm cc-ing Alexey Pawlow, the author of the changes.

Thanks!

Arnold



Re: bug#34951: [PATCH] grep: a kwset matcher not work in a grep matcher

2019-03-29 Thread arnold
Hi.

Norihiro Tanaka  wrote:

> Missing a patch for dfa.  Re-send correct patch file.

Paul - is this going to be merged into GNULIB? If so, I'll put it into
gawk now; I want to make a release soon.

Thanks,

Arnold
[



Re: [Grep-devel] Changed behavior in sed 4.6

2018-12-21 Thread arnold
Jim Meyering  wrote:

> On Thu, Dec 20, 2018 at 9:13 PM  wrote:
> > > I expect to revert the mentioned mentioned gnulib commits, and then to
> > > make new releases of both grep and sed.
> >
> > Please add a test case ...
>
> You should know me better by now.
> I didn't mention the required NEWS update either.

Indeed, I knew I was requesting the obvious, but sometime people forget ...

In any case, much thanks!

Arnold



Re: [Grep-devel] Changed behavior in sed 4.6

2018-12-20 Thread arnold
Jim Meyering  wrote:

> On Thu, Dec 20, 2018 at 2:49 PM Jan Palus  wrote:
> > I've just happened to notice a difference in behavior between sed 4.5 and 
> > 4.6
> > when building VirtualBox. It seems to be locale dependent:
> >
> > $ echo 'foo(bar '|LC_ALL=C sed -e 's/\([^*] *\)\bbar\b/\1foo */g'
> > foo(bar
> >
> > $ echo 'foo(bar '|LC_ALL=C.UTF-8 sed -e 's/\([^*] *\)\bbar\b/\1foo */g'
> > foo(foo *
> >
> > In 4.5 both results are the same -- same as the second output with
> > LC_ALL=C.UTF-8.
>
> Thanks a lot for that report.
> This is indeed a regression. It also affects the just-release
> grep-3.2, since the source is in a file used by both: gnulib's dfa.c.
> I tracked it down to this gnulib/lib/dfa.c commit: v0.1-2213-gae4b73e28
> To back that out, I must first revert part of this fix-up patch:
> v0.1-2281-g95cd86dd7
>
> Here's a demonstrator with grep: (it should match, but with 3.2, does not):
>
> $ echo 123-x|LC_ALL=C grep '.\bx'
> $
>
> To avoid the failure, one can:
> - specify -P (for PCRE, a different matcher), or
> - don't use the C locale, but rather use a multi-byte locale like the
> one you chose, which inhibits use of the DFA matcher, because \b's
> definition requires multi-byte aware machinery not present in the DFA
> matcher.
>
> I expect to revert the mentioned mentioned gnulib commits, and then to
> make new releases of both grep and sed.

Please add a test case ...

THanks,

Arnold



Re: [Grep-devel] handling of non-BMP characters

2018-12-19 Thread arnold
Paul Eggert  wrote:

> On 12/18/18 11:51 PM, Bruno Haible wrote:
> >2) change those gnulib modules that don't behave well with beyond-BMP
> >   characters on Windows and AIX to use char32_t instead of wchar_t.
>
> This sounds good to me. I assume the regexp code will need to be changed 
> accordingly, and if so I can volunteer to coordinate that with glibc 
> (we're close to a freeze in Glibc, but we can install into Gnulib first).
>

I assume you'll make parallel changes in dfa.c at the same time?

Thanks,

Arnold



Re: __builtin_expect used in regex

2018-10-17 Thread arnold
Hi.

Paul Eggert  wrote:

> On 10/1/18 11:31 AM, arn...@skeeve.com wrote:
> > Those changes look really excessive to me. I prefer to not have to
> > keep including more and more files from gnulib just to compile regex
> > or dfa.
>
> Sorry, I didn't read your message (I had misfiled it) until just now, 

Oops.

> after I propagated the patch into glibc. So now I will have to go into 
> repair mode

Is the patch also in gnulib?

> I would rather minimize the difference from glibc. Is this the only 
> place where the Gawk regex code departs from the Gnulib copy? If so, 
> let's try to come up with a way to keep the source identical, if only by 
> using "#ifdef _LIBC" or "#ifdef GAWK" or whatever.

In my custom.h, I have added

/* This keeps regex happy on non-GCC compilers */
#ifndef __GNUC__
#ifndef __builtin_expect
#define __builtin_expect(expr, val) (expr)
#endif

I did not actually change the regex files.

> > (As a side point, does all the __builtin_expect / __glibc_unlikely
> > stuff *really* make that much difference?  It sure clutters up
> > the code unmercifully.)
>
> I agree. I don't think they make much performance difference nowadays. I 
> plan to time them and see if we're right; if so, let's get rid of them 
> (in glibc regex, Gnulib, and in Gawk).

So, let's wait until the results of all this. Once you update regex
in Gnulib I will sync with it.

Thanks,

Arnold



Re: __builtin_expect used in regex

2018-10-01 Thread arnold
Hi Paul,

Paul Eggert  wrote:

> Thanks for reporting the problem. Please try the attached patch against
> Gawk master. The ChangeLog entry is a bit optimistic, as it is assuming
> that the patch works (and if it works, I would like to install the
> relevant changes into Gnulib and into glibc, so that at that point the
> ChangeLog entry will be correct).

Those changes look really excessive to me.  I prefer to not have to
keep including more and more files from gnulib just to compile regex
or dfa.

(As a side point, does all the __builtin_expect / __glibc_unlikely
stuff *really* make that much difference?  It sure clutters up
the code unmercifully.)

> Alternatively, if you want a smaller patch you can arrange for
> __builtin_expect to be a no-op on compilers that do not support it,
> as Bruno suggested.

That is what I will do, either in my custom.h or directly in
regex_internal.h.

Thanks,

Arnold



__builtin_expect used in regex

2018-09-30 Thread arnold
Hello GNULIB guys.

Pleae see the patch below which Nelson needs in order to compile
gawk on several of his systems.  This comes from the use of the BE
macro in regex.

Nelson, please chime in with a list of the system + compiler combinations
where gawk needs this patch.  As I mentioned, this is really a gnulib
issue and thus I'm reporting it there.

I will apply this patch, probably later this week, unless the GNULIB
guys, with your help, can patch regex directly.

Thanks,

Arnold

> Date: Sat, 29 Sep 2018 16:05:35 -0600
> From: "Nelson H. F. Beebe" 
> To: "Arnold Robbins" 
> Cc: be...@math.utah.edu
>
> I propose the following patch to fix the __builtin_expect() problem:
>
> % cat /home/gnu/src/gawk/gawk-2018-09-29.patch.p1
> *** custom.h.orgSat Sep 29 14:22:37 2018
> --- custom.hSat Sep 29 14:56:04 2018
> ***
> *** 53,58 
> --- 53,64 
>   #endif
>   #endif
>   
> + #ifndef __GNUC__
> + #ifndef __builtin_expect
> + #define __builtin_expect(expr, val) (expr)
> + #endif
> + #endif
> + 
>   /* For QNX, based on submission from Michael Hunter, mphun...@qnx.com */
>   #ifdef __QNX__
>   #define GETPGRP_VOID  1
>
>
> I have applied it on a half-dozen systems that failed to build today
> from the gawk-2018-09-29 snapshot that I pulled earlier this afternoon;
> all of those systems then had successful builds.
>
> I'm still uncomfortable with the fact that a compiler-specific
> feature, __builtin_expect, was used without adequate fallback
> for non-gcc compiler builds.  My patch remedies that problem.
>
> ---
> - Nelson H. F. BeebeTel: +1 801 581 5254  
> -
> - University of UtahFAX: +1 801 581 4148  
> -
> - Department of Mathematics, 110 LCBInternet e-mail: be...@math.utah.edu  
> -
> - 155 S 1400 E RM 233   be...@acm.org  be...@computer.org 
> -
> - Salt Lake City, UT 84112-0090, USAURL: http://www.math.utah.edu/~beebe/ 
> -
> ---
>



Re: Small patch to regex_internal.h for z/OS

2018-08-22 Thread arnold
Paul Eggert  wrote:

> Thanks, I installed that into Gnulib and into glibc.

Most excellent! Thanks.



Small patch to regex_internal.h for z/OS

2018-08-22 Thread Arnold Robbins
Hi.

I have applied the following patch to my copy of regex_internal.h; it's
needed for compilation in the POSIX environment on z/OS.

Thanks,

Arnold
--
--- /usr/local/src/Gnu/gnulib/lib/regex_internal.h  2018-07-18 
21:16:31.670542200 +0300
+++ support/regex_internal.h2018-08-22 18:46:06.006186098 +0300
@@ -149,7 +149,10 @@
 /* Rename to standard API for using out of glibc.  */
 #ifndef _LIBC
 # undef __wctype
+# undef __iswalnum
 # undef __iswctype
+# undef __towlower
+# undef __towupper
 # define __wctype wctype
 # define __iswalnum iswalnum
 # define __iswctype iswctype



Re: [bug-gawk] [PATCH] Avoid left-shifting a negative value (by a positive value)

2018-08-19 Thread arnold
Paul Eggert  wrote:

> arn...@skeeve.com wrote:
> > The file under discussion came from GNULIB (I
> > believe) so I'm adding bug-gnulib and will let that team comment on
> > this.
>
> We fixed that long ago in a different way in Gnulib, so presumably the next 
> time 
> Gawk syncs mktime.c from Gnulib it'll fix the problem then.

I have pulled the latest into gawk-4.2-stable; waiting to hear from
my porting team.

Thanks.

Arnold



Re: [bug-gawk] [PATCH] Avoid left-shifting a negative value (by a positive value)

2018-08-16 Thread arnold
Hi.

Thanks for the note.  The file under discussion came from GNULIB (I
believe) so I'm adding bug-gnulib and will let that team comment on
this.

Given that it's not an issue on commonly used CPUs, I don't see this
as a high priority issue either way.

Thanks,

Arnold

Samy Mahmoudi  wrote:

> Hello,
>
> Compiling with the option -Wshift-negative-value outputs the following
> warning:
>
> missing_d/mktime.c:82:22: warning: left shift of negative value
> [-Wshift-negative-value]
>: ~ (time_t) 0 << (sizeof (time_t) * CHAR_BIT - 1))
>
> In relation to gcc PR c/65179, Martin Sebor wrote:
>
> "Shifting a negative value by a positive number of bits does have a natural
> meaning (i.e., shifting the bit pattern the same way as unsigned). The
> reason why it's undefined in C and C++ is because some processors don't
> shift the sign bit out and may raise an overflow when a one bit is shifted
> into the sign position (typically those that provide an arithmetic left
> shift). But most processors implement a logical left shift and behave the
> same way for signed operands as for unsigned. The result of a left shift of
> a negative number computed by GCC matches that of hardware that doesn't
> differentiate between arithmetic and logical left shifts (which is all the
> common CPUs, including ARM, MIPS, PowerPC, x86), so the only value in
> diagnosing it is portability to rare CPUs or to compilers that behave
> differently than GCC (if there are any)."
>
> On most platforms, the attached patch does not provide any functional
> change.
>
> Besides, do you think using intmax_t and uintmax_t could results in a
> portability loss ?
>
> Best regards,
> Samy Mahmoudi

---
diff --git a/missing_d/mktime.c b/missing_d/mktime.c
index d394ef17..16a944d3 100644
--- a/missing_d/mktime.c
+++ b/missing_d/mktime.c
@@ -79,7 +79,7 @@

 #ifndef TIME_T_MIN
 #define TIME_T_MIN (0 < (time_t) -1 ? (time_t) 0 \
-   : ~ (time_t) 0 << (sizeof (time_t) * CHAR_BIT - 1))
+   : (time_t) (intmax_t) ~ ((uintmax_t) ~ (time_t) 0 >> 1))
 #endif
 #ifndef TIME_T_MAX
 #define TIME_T_MAX (~ (time_t) 0 - TIME_T_MIN)



Re: Rational Ranges [was Re: gnulib regex lib]

2018-08-13 Thread arnold
Paul Eggert  wrote:

> arn...@skeeve.com wrote:
> > The only FIXMEs I see are both in the _LIBC part of the code, and
> > there's only two: one in regexec.c and one in regcomp.c.
>
> In that case I guess there isn't a problem.
>
> I am a little concerned that unibyte locales use bytes whereas multibyte 
> locales 
> use characters for range expressions. As I understand it, this means Turkish 
> range expressions are interpreted differently depending on whether the locale 
> uses UTF-8 or ISO/IEC 8859-9. Is that really what Turkish-speakers want?

It's a sad fact of life that users have to be aware of their character set /
locale and understand the consequences of what they choose to use (or
what their OS has chosen for them upon installation).  This is just
another aspect of that.

> That being said, it doesn't matter all that much nowadays now that UTF-8 has 
> taken over, so it's probably not worth much of our time to worry about this 
> discrepancy. For what it's worth, 
> https://w3techs.com/technologies/details/en-iso885909/all/all says that only 
> 0.06% of websites still use ISO/IEC 8859-9, down from 0.09% a year ago (and 
> down 
> from 0.7% in 2010, so this is a factor-of-10 decline in 8 years).

I totally agree that it's not worth worrying about. It's a too small
tail to be wagging such a big dog.

Thanks,

Arnold



Re: Rational Ranges [was Re: gnulib regex lib]

2018-08-12 Thread arnold
Paul Eggert  wrote:

> arn...@skeeve.com wrote:
> > Can you elaborate? Is it mainly in the LIBC part of the code that it's
> > not implemented correctly?
>
> Sorry, I haven't followed that part of the code closely. There are some 
> FIXMEs 
> there, as I recall. I'd be surprised if RRI were fully implemented even in 
> the 
> !_LIBC part of the code.

The only FIXMEs I see are both in the _LIBC part of the code, and
there's only two: one in regexec.c and one in regcomp.c.

THanks,

Arnold



Re: Rational Ranges [was Re: gnulib regex lib]

2018-08-12 Thread arnold
Paul Eggert  wrote:

> arn...@skeeve.com wrote:
> > Can you elaborate? Is it mainly in the LIBC part of the code that it's
> > not implemented correctly?
>
> Sorry, I haven't followed that part of the code closely. There are some 
> FIXMEs 
> there, as I recall. I'd be surprised if RRI were fully implemented even in 
> the 
> !_LIBC part of the code.

I find this statement surprising and discouraging. I would like to see
a test case to prove/disprove it one way or the other, particularly
for multibyte locales.  As the original RRI code came from gawk, I
am pretty sure that the ! _LIBC part of the code does get it right.
Or at least did in my version.

Thanks,

Arnold



Rational Ranges [was Re: gnulib regex lib]

2018-08-12 Thread arnold
Hi.

Paul Eggert  wrote:

> Rather than spend much time worring about this little comment, it'd
> probably be more helpful to document the intended behavior of rational
> ranges. As I understand it, Arnold wants them to use byte values in
> unibyte locales and wide character values in multibyte locales,

Yes, that's right. It's particularly important for the single
byte locales.

> and this
> intent is worth mentioning somewhere central, particularly since there
> are multiple places in the code where it is not implemented properly.

Can you elaborate? Is it mainly in the LIBC part of the code that it's
not implemented correctly?

Thanks,

Arnold



Re: [gawk-devel] changing regex lib

2018-08-12 Thread arnold
Paul Eggert  wrote:

> arn...@skeeve.com wrote:
> > I tried out Paul's change and it works for me.
>
> Thanks for checking. I installed the regcomp.c change into glibc and gnulib 
> so 
> we should now have the same source there as we have in Gawk.

Thanks Paul!

I will likely merge that into the gawk mainline this week.

Arnold



Re: [PATCH 07/17] Regex: Additional memory management checks.

2017-12-20 Thread arnold
Hi Paul.

Paul Eggert <egg...@cs.ucla.edu> wrote:

> On 12/08/2017 01:16 AM in 
> <https://sourceware.org/ml/libc-alpha/2017-12/msg00242.html> Arnold 
> Robbins wrote:
> > +  /* some malloc()-checkers don't like zero allocations */
>
> Which checkers are these?

Lord only knows. That change has been in gawk's regex for years and
years and I don't remember. So:

> Can they be told that 'malloc (0)' is OK? 

Practically speaking, no.

> > +   * ADR: valgrind says size can be 0, which then doesn't
> > +   * free the block of size 0.  Harumph. This seems
> > +   * to work ok, though.
> > +   */
> > +  if (size == 0)
> > +{
> > +   memset(set, 0, sizeof(*set));
> > +   return REG_NOERROR;
> > +}
> > set->alloc = size;
> > set->nelem = 0;
> > set->elems = re_malloc (int, size);
>
> For this, how about if we use the corresponding Gnulib fix instead? An 
> advantage of the Gnulib fix is that it doesn't introduce runtime 
> overhead when glibc is used. Here is a URL:

I think that the runtime overhead is so small that it cannot be
measured.  I don't want to pull in more gnulib m4 goop for this.
The GLIBC guys can do as they wish, of course. :-)

> > diff --git a/posix/regexec.c b/posix/regexec.c
> > index 2d2bc46..8573765 100644
> > --- a/posix/regexec.c
> > +++ b/posix/regexec.c
> > @@ -605,7 +605,7 @@ re_search_internal (const regex_t *preg, const char 
> > *string, int length,
> > nmatch -= extra_nmatch;
> >   
> > /* Check if the DFA haven't been compiled.  */
> > -  if (BE (preg->used == 0 || dfa->init_state == NULL
> > +  if (BE (preg->used == 0 || dfa == NULL || dfa->init_state == NULL
> >   || dfa->init_state_word == NULL || dfa->init_state_nl == NULL
> >   || dfa->init_state_begbuf == NULL, 0))
> >   return REG_NOMATCH;
>
> Why is this change needed? I couldn't see a code path that would trigger 
> it.

I managed once while doing some changes to cause dfa to be NULL. So
I added the check.  I don't remember what I did.

Thanks,

Arnold




Re: [PATCH 06/17] Regex: Use re_malloc / re_free consistently.

2017-12-20 Thread arnold
Thanks, I have merged this in to gawk's version.

Paul --- I think that  you have permission to push the patches you approve
to glibc. Please do so.

Thanks,

Arnold

Paul Eggert <egg...@cs.ucla.edu> wrote:

> On 12/08/2017 01:16 AM, Arnold Robbins wrote:
> > This patch changes several calls to malloc/free into re_malloc/re_free,
> > bringing consistency to the code.
>
> Thanks, that patch makes sense, but it misses three opportunities to 
> bring consistency. regcomp.c has one call each to malloc and free, which 
> should be consistent too. Also, regexec.c has a call to realloc that 
> should be be changed to re_realloc. A minor formatting issue: one 
> newly-introduced re_malloc call doesn't need to appear on the next line.
>
> (Possibly we should be adding consistency in the opposite way, by 
> removing the macros re_free, re_malloc, and re_realloc, and simply using 
> the underlying C functions. These macros are tricky since they are 
> function-like but (aside from re_free) cannot be implemented as 
> functions, and they don't buy much. But that'd be a bigger change.)
>
> I installed the attached patch into Gnulib; it contains the originally 
> proposed patch 06/17 along with the abovementioned fixups. Something 
> like this should be easily installable into glibc.
>



Re: [PATCH 01/17] Regex: Fix spelling errors / typos.

2017-12-20 Thread arnold
Thanks!

One down, 16 to go... :-)

"Carlos O'Donell" <car...@redhat.com> wrote:

> On 12/19/2017 02:03 PM, Paul Eggert wrote:
> > These typo changes are all in Gnulib, and would be fine to install in glibc.
> > 
> > Adding bug-gnulib to the CC list. For reference, the original email is here:
> > https://sourceware.org/ml/libc-alpha/2017-12/msg00243.html
>
> Done. I've pushed these for Arnold.
>
> commit 5069ff32842c60c55f8b573ee66fe43f9ec364af
> Author: Arnold Robbins <arn...@skeeve.com>
> Date:   Tue Dec 19 19:26:08 2017 -0800
>
> regex: Fix spelling in comments.
> 
> Fix the spelling in various comments throughout the
> regex implementation. These changes are also present
> in gnulib and will be integrated there also, see:
> https://sourceware.org/ml/libc-alpha/2017-12/msg00688.html
>
>
> -- 
> Cheers,
> Carlos.



Re: [PING] [PATCH 00/17] Regex: Make libc regex more usable outside GLIBC

2017-12-19 Thread arnold
Hello.

Thanks for cluing me into the discussion.

> As I understand it, Arnold's patches are against glibc. Arnold, would it
> be too much trouble to rebase them against gnulib instead?

Absolutely too much trouble. Sorry.

I think that most or all of the changes are in gnulib's regex, but
the gnulib regex has too many changes (Idx instead of int everywhere,
to name the main one) for me to be willing to try and figure it out.

I think it will be actually easier to merge my changes in and then
compare to gnulib, but that's up to you.

> I agree that syncing first with gnulib is the way to go.

Not in my humble opinion, but I'm not doing the work. (:-)

> Also, Arnold, I think you should be aware that glibc is coming up
> on a release freeze, so reviewers' time is going to be focused on
> higher-urgency stuff for the next month or so.  I will try to find time
> to assist with these patches but it won't be till January.

OK, thanks for letting me know what things are like on the glibc team.
I won't ping about this again until mid- or late January.

I do appreciate that the trend is to merge with gnulib; that will
ultimately be a good thing so I am encouraged by it.

Thanks,

Arnold



Re: [Grep-devel] patches for removing DFA_CASE_FOLD

2016-12-13 Thread arnold
Paul Eggert <egg...@cs.ucla.edu> wrote:

> On 12/13/2016 12:26 PM, Arnold Robbins wrote:
> > -  dfa->syntax.case_fold = (dfaopts & DFA_CASE_FOLD) != 0;
> > +  dfa->syntax.case_fold = (bits & RE_ICASE) != 0
>
> I'm afraid that didn't work, due to a missing semicolon. I fixed that 
> up, fiddled with the commit messages, updated grep submodules, and 
> installed the result into gnulib and grep.

Wonderful, thanks!

Arnold



Re: bug#22357: grep -f not only huge memory usage, but also huge time cost

2016-12-11 Thread arnold
Bruno Haible <br...@clisp.org> wrote:

> Finally, code this formula into the 'grep' program.

I'm sure that Paul and Jim would welcome patches.

Arnold