Re: Fwd: odd behavior of length(), match() and field splitting with multi-byte characters

Ed Morton via Cygwin Tue, 20 Aug 2024 04:00:13 -0700

Is there any more information I can provide for someone to be able tolook into this bug?

Ed.


On 7/6/2024 7:26 AM, Ed Morton wrote:

I posted the below bug report to the GNU awk bugs mailing list,https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00000.html, thefeedback there is that it's a cygwin or MSYS2 port issue, could youplease take a look? I'm also posting this athttps://github.com/msys2/mingw-packages/issues per the advice from theGNU bug list.
Regards,

    Ed Morton.

-------- Forwarded Message --------
Subject: odd behavior of length(), match() and field splitting withmulti-byte characters
Date:   Mon, 1 Jul 2024 05:56:02 -0500
From:   Ed Morton
To:     bug-g...@gnu.org <bug-g...@gnu.org>



Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong--param=ssp-buffer-size=4-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1-DNDEBUGuname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_642024-04-03 17:25 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin

Gawk Version: 5.3.0

Attestation 1:
I have readhttps://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
        Yes

Attestation 2:
        I have not modified the sources before building gawk.
        True

Description:
        gawk is reporting odd lengths and matches of strings
        when multi-byte characters are involved.

Repeat-By:
Someone on StackOverflow asked about a couple of issues theysaw that, so far at least, no-one there can explain and seem to justbe bugs.
1)https://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138715434_78676444andhttps://stackoverflow.com/questions/78676444/conditional-replacement-of-arbitrarily-large-strings-that-occur-at-arbitrary-dis#comment138720207_78676444:
        If we output 4 multi-byte characters as 10 bytes using:

            $ echo '61F09F948DF09F948E62' | xxd -r -p > file1
            $

        and run the following gawk command on it we get the output shown:

            $ LC_ALL=en_US.utf8 gawk '{print(length($0))}' file1
            6
            $

        i.e. 6 instead of 4. If we run
$ printf 'F0989A9F' | xxd -r -p | LC_ALL=en_US.utf8 awk -F'' '{print NF, length(); for (i=1; i<=NF; i++) print $i}' | cat -A
            2 2$
            M-pM-^XM-^Z$
            M-^_$
            $
it shows that what is intended to be single a 4-byte characteris being treated as 2 characters, one 3 bytes and the other 1 byte.
2)https://stackoverflow.com/questions/78690533/why-does-the-match-function-not-work-in-this-particular-situation
        If we create some input using:
$ echo'3C6469763E3C6469763E5F3C2F6469763E5F3C68313E6162636465665F3C2F68313E5F3C2F6469763E3C6469763EF09F93853C2F6469763E0A'| xxd -r -p > file2
        and then run this on it we get the expected output shown::
$ LC_ALL=en_US.utf8 gawk'{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
            abcdef
            $

        but if we add the `IGNORECASE` flag we get a blank line output:
$ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1'{match($0,/^.*_<h1>(.*)_<\/h1>.*$/,a); print a[1]}' file2
            $
unless we also remove the end of string delimiter, `$`, fromthe end of the regexp:
$ LC_ALL=en_US.utf8 gawk -vIGNORECASE=1'{match($0,/^.*_<h1>(.*)_<\/h1>.*/,a); print a[1]}' file2
            abcdef
            $


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: Fwd: odd behavior of length(), match() and field splitting with multi-byte characters

Reply via email to