Public bug reported: In theory the regular expression ^.*$ should match any and every string, including empty strings, but this specific Korean character U+D56D (항), which I was unlucky enough to have one of my scripts come across, breaks the expected behavior in egrep:
$ echo '' | egrep '^.*$'; echo $? 0 $ echo 'foo' | egrep '^.*$'; echo $? foo 0 $ echo 'bar' | egrep '^.*$'; echo $? bar 0 $ echo 'の名' | egrep '^.*$'; echo $? の名 0 $ echo '항' | egrep '^.*$'; echo $? 1 Have I lost my mind...or should I go buy a lottery ticket? Here are some rambling one-liners to illustrate the behavior further. # An attempt to match the pattern ^.*$ (beginning of string, anything, end of string) against this Korean character fails: $ echo '항' | egrep '^.*$'; echo $? 1 # As you can see here a match works when the $ is dropped from the pattern: $ echo '항' | egrep '^.*'; echo $? 항 0 # Also using the -P flag from grep instead of -E correctly matches the original pattern: $ echo '항' | grep -P '^.*$'; echo $? 항 0 # Sending a different Korean character (U+C720) to the same original pattern works as expected as well: $ echo '유' | egrep '^.*$'; echo $? 유 0 # Combining the two leads to the original failure mentioned: $ echo '항유' | egrep '^.*$'; echo $? 1 # And reversing the order of the combination does not effect the outcome: $ echo '유항' | egrep '^.*$'; echo $? 1 # But dropping the $ from the pattern gives the expected match: $ echo '유항' | egrep '^.*'; echo $? 유항 0 # Dropping the ^ from the pattern also gives the expected match: $ echo '유항' | egrep '.*$'; echo $? 유항 0 # Surrounding U+D56D with U+C720 does not alter the behavior: $ echo '유항유' | egrep '^.*$'; echo $? 1 # But again dropping U+D56D (항) from the input string returns egrep to the expected behavior: $ echo '유유' | egrep '^.*$'; echo $? 유유 0 # And to make it very clear what the input is, here I'm using python to give a raw dump of the input: $ echo '유항유' | python -c 'import sys; print(repr(sys.stdin.read().encode("unicode-escape")))' b'\\uc720\\ud56d\\uc720\\n' # My grep/egrep version: $ grep --version grep (GNU grep) 3.4 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and others; see <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>. $ egrep --version grep (GNU grep) 3.4 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Mike Haertel and others; see <https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>. # My bash version $ bash --version GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu) Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. =========================== If somebody could explain this behavior I would appreciate it. If it could be fixed, even better. In the meantime I think I will prefer 'grep -P' over 'egrep' when I expect strings to contain Korean text. In this contrived example the '^' and '$' didn't make a lot of sense, but I thought it would be best to provide the simplest possible reproduction case rather than spell out my full use case. ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: grep 3.4-1 ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78 Uname: Linux 5.4.0-65-generic x86_64 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl ApportVersion: 2.20.11-0ubuntu27.16 Architecture: amd64 CasperMD5CheckResult: skip Date: Mon Feb 15 17:10:42 2021 InstallationDate: Installed on 2020-01-22 (389 days ago) InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805) SourcePackage: grep UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago) ** Affects: grep (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug focal -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1915738 Title: egrep: U+D56D (항) breaks ^/$ matching To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1915738/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs