Public bug reported:

In theory the regular expression ^.*$ should match any and every string,
including empty strings, but this specific Korean character U+D56D (항),
which I was unlucky enough to have one of my scripts come across, breaks
the expected behavior in egrep:

$ echo '' | egrep '^.*$'; echo $?

0
$ echo 'foo' | egrep '^.*$'; echo $?
foo
0
$ echo 'bar' | egrep '^.*$'; echo $?
bar
0
$ echo 'の名' | egrep '^.*$'; echo $?
の名
0
$ echo '항' | egrep '^.*$'; echo $?
1

Have I lost my mind...or should I go buy a lottery ticket? Here are some
rambling one-liners to illustrate the behavior further.

# An attempt to match the pattern ^.*$ (beginning of string, anything, end of 
string) against this Korean character fails:
$ echo '항' | egrep '^.*$'; echo $?
1

# As you can see here a match works when the $ is dropped from the pattern:
$ echo '항' | egrep '^.*'; echo $?
항
0

# Also using the -P flag from grep instead of -E correctly matches the original 
pattern:
$ echo '항' | grep -P '^.*$'; echo $?
항
0

# Sending a different Korean character (U+C720) to the same original pattern 
works as expected as well:
$ echo '유' | egrep '^.*$'; echo $?
유
0

# Combining the two leads to the original failure mentioned:
$ echo '항유' | egrep '^.*$'; echo $?
1

# And reversing the order of the combination does not effect the outcome:
$ echo '유항' | egrep '^.*$'; echo $?
1

# But dropping the $ from the pattern gives the expected match:
$ echo '유항' | egrep '^.*'; echo $?
유항
0

# Dropping the ^ from the pattern also gives the expected match:
$ echo '유항' | egrep '.*$'; echo $?
유항
0

# Surrounding U+D56D with U+C720 does not alter the behavior:
$ echo '유항유' | egrep '^.*$'; echo $?
1

# But again dropping U+D56D (항) from the input string returns egrep to the 
expected behavior:
$ echo '유유' | egrep '^.*$'; echo $?
유유
0

# And to make it very clear what the input is, here I'm using python to give a 
raw dump of the input:
$ echo '유항유' | python -c 'import sys; 
print(repr(sys.stdin.read().encode("unicode-escape")))'
b'\\uc720\\ud56d\\uc720\\n'

# My grep/egrep version:
$ grep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
$ egrep --version
grep (GNU grep) 3.4
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

# My bash version
$ bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


===========================

If somebody could explain this behavior I would appreciate it. If it
could be fixed, even better. In the meantime I think I will prefer 'grep
-P' over 'egrep' when I expect strings to contain Korean text. In this
contrived example the '^' and '$' didn't make a lot of sense, but I
thought it would be best to provide the simplest possible reproduction
case rather than spell out my full use case.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: grep 3.4-1
ProcVersionSignature: Ubuntu 5.4.0-65.73-generic 5.4.78
Uname: Linux 5.4.0-65-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair wl
ApportVersion: 2.20.11-0ubuntu27.16
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Feb 15 17:10:42 2021
InstallationDate: Installed on 2020-01-22 (389 days ago)
InstallationMedia: Ubuntu 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
SourcePackage: grep
UpgradeStatus: Upgraded to focal on 2021-02-01 (13 days ago)

** Affects: grep (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug focal

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1915738

Title:
  egrep: U+D56D (항) breaks ^/$ matching

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1915738/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to