Package: gawk
Version: 1:5.2.1-2 1:5.1.0-1

Hi,

I initially ran into this issue on Debian 11 Bullseye, but I can also
reproduce it in Debian Unstable as of now:

We do have logs which separate fields with "||", i.e. two pipe
characters. (Yeah, likely not ideal, but that's given. :-)

With mawk I can parse them easily:

  $ echo 'a||b' | mawk -F'\|\|' '{print $1"X"$2}'
  aXb

(backslashes because multicharacter $FS is considered to be a regular
expression and hence the special character pipe needs to be
escaped. mawk also argues otherwise — IMHO correctly.)

gawk though behaves strange and especially inconsistently:

  $ echo 'a||b' | gawk -F'\|\|' '{print $1"X"$2}'
  gawk: warning: escape sequence `\|' treated as plain `|'
  a||bX

Ok, so '\|' should be written as just '|'? Unexpected, but ok. Let's do
that:

  $ echo 'a||b' | gawk -F'||' '{print $1"X"$2}'
  a||bX

No more argues, but the output is as wrong as before. It's also not that
it treated the pipe as regular expression (in which case it would
probably match any empty string twice and should probably output
something like "a|").

I though would have kinda expected that "||" is considered to be a
regular expression and hence would require the backslash.

Using e.g.

  $ echo 'a||b' | gawk 'FS="\|\|" {print $1"X"$2}'
  gawk: cmd. line:1: warning: escape sequence `\|' treated as plain `|'
  a||bX

seems to make no difference.

What does work as expected with gawk (and mawk) is though this:

  $ echo 'a||b' | gawk -F'[|][|]' '{print $1"X"$2}'
  aXb

Interestingly, if only a single pipe character is used as delimited it
works as expected again:

  $ echo 'a|b' | gawk -F'\|' '{print $1"X"$2}'
  gawk: warning: escape sequence `\|' treated as plain `|'
  aXb
  $ echo 'a|b' | gawk -F'|' '{print $1"X"$2}'
  aXb

So the bug seems to only appear if at least two pipes are used as
delimiter. (It behaves the same way with three pipes as with two pipes.)

Part of the bug or a separate bug might be that it argues even in the
two character version (hence expected to be a regexp) about "\|" being
interpreted as plain "|" which from my point of view is only correct in
the one-letter (plus espaping) variant '\|', but not for '\|\|'.

Counter examples:

  $ echo 'afbgc' | awk -F 'f|g' '{print $1, $2, $3}'
  a b c
  $ echo 'afbgc' | awk -F 'f\|g' '{print $1, $2, $3}'
  awk: warning: escape sequence `\|' treated as plain `|'
  a b c
  $ echo 'af|gc' | awk -F 'f\|g' '{print $1, $2}'
  awk: warning: escape sequence `\|' treated as plain `|'
  a |

In the last example it IMHO should not have replaced the "\|" with just
a "|" which is also not "plain" but a special character which was meant
to be escaped. The wanted output was "a c".

-- System Information:
Debian Release: trixie/sid
  APT prefers unstable
  APT policy: (990, 'unstable'), (600, 'testing'), (500, 'unstable-debug'), 
(500, 'buildd-unstable'), (110, 'experimental'), (1, 'experimental-debug'), (1, 
'buildd-experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.5.0-4-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_WARN
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)
LSM: AppArmor: enabled

Versions of packages gawk depends on:
ii  libc6         2.37-13
ii  libgmp10      2:6.3.0+dfsg-2
ii  libmpfr6      4.2.1-1
ii  libreadline8  8.2-3
ii  libsigsegv2   2.14-1

gawk recommends no packages.

Versions of packages gawk suggests:
pn  gawk-doc  <none>

-- no debconf information

Reply via email to