Hi Pedro,

Pedro <[email protected]> writes:

> Hello.
> I'm running tac compiled from the canonical 9.11 release.

Apologies for the delayed response. Your questions/observations are
good. However, I rarely use 'tac --regex', so it wasn't immediately
obvious what was occurring to me.

> - Issue #1: '$' seems to be ignored if preceded by certain patterns:
>
> $ printf 1234 | tac -rs '.$'
> Expected output: 1234
> Actual output: 4321
>
> $ printf 1234 | tac -rs '\w$'
> Expected output: 1234
> Actual output: 4321
>
> $ printf 1234 | tac -brs '.$'
> Expected output: 4123
> Actual output: 4321

In these cases "$" is not being ignored. Instead all of the characters
are being treated as separators.

The 'tac' program operates by reading fixed sized buffers, which is
relevant to your later examples, and then scanning backwards for
separators. The "$" character matches the end of the string. In the
above examples the entire string fits into one buffer. The first
separator matched is "4" which is the end of the string. Therefore, it
is output first. After that the end of string is now "3" which is again
matched and so on.

It might help you to visualize it like this:

     "" 4 "" 3 "" 2 "" 1 ""

This is sort of how 'tac' sees it. The numbers are separators, which
have empty strings in between them.

> - Issue #2: input is split into 8192 bytes records if '^' is used:
>
> $ for i in 1 2 3; do head -c8192 /dev/zero | tr \\0 $i; done |
>   tac -rs '^' | od -Ax -tx1z
> Expected output:
> 0000 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31  >1111111111111111<
> *
> 2000 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32  >2222222222222222<
> *
> 4000 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33  >3333333333333333<
> *
> 6000
> Actual output:
> 0000 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33  >3333333333333333<
> *
> 2000 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32  >2222222222222222<
> *
> 4000 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31  >1111111111111111<
> *
> 6000
>
> Note that '$' behaves as expected here:
>
> $ for i in 1 2 3; do head -c8192 /dev/zero | tr \\0 $i; done |
>   tac -rs '$' | od -Ax -tx1z
> 0000 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31  >1111111111111111<
> *
> 2000 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32  >2222222222222222<
> *
> 4000 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33  >3333333333333333<
> *
> 6000
>
> The exact same issues apply to \' and \` respectively.
>
> I've been using tac -rs \\\` as a convenient "delaying" filter (a filter 
> that behaves like cat but consumes all of its input before writing any 
> output) until it broke for inputs exceeding 8192 bytes. I wasn't using \' 
> because it is much slower.

Yep, 'tac' reads 8192 bytes at a time. In this case, "^" matches the
beginning of the string. Therefore, the start of each 8192 bye buffer.

> If not bugs, there seems to be at least some kind of inconsistencies in the 
> current behavior that needs to be documented, but let me know if I'm 
> missing something.

It seems like you thought that "^" and "$" operate on lines of text
instead of buffers, is that correct? I think there was some attempt to
address that here [1]:

    Records are separated by instances of a string (newline by default).
    By default, this separator string is attached to the end of the record
    that it follows in the file.

Perhaps a note is needed under the description of '--regex' that "^" and
"$" operate on records instead of lines. What do you think of that idea?

Thanks,
Collin

[1] https://www.gnu.org/software/coreutils/manual/html_node/tac-invocation.html

Reply via email to