Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes

2017-01-02 Thread Laslo Hunhold
On Tue, 27 Dec 2016 18:03:24 -0800
Michael Forney  wrote:

Hey Michael,

>

yes, the reasoning makes sense. Could you please update your patch so it
applies on git HEAD?

Cheers

Laslo

-- 
Laslo Hunhold 



Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes

2016-12-27 Thread Michael Forney
On 12/27/16, Evan Gates  wrote:
> On Tue, Dec 27, 2016 at 5:55 AM, Laslo Hunhold  wrote:
>> well-spotted! Still, it's _very_ counterintuitive to call the flag
>> "-c". Instead of adding a non-portable m-flag, it would even sound
>> better to me to add a b-flag for byte-offsets.

Yes, it's a bit counter-intuitive, but conflicting with POSIX for this
alone seems like a really bad idea. I always consult POSIX when
writing shell scripts to ensure that they will run on any conforming
system. If sbase decided that the option character name was not the
best choice, then reasonable, valid, and portable scripts may start
operating unexpectedly with no indication as to why.

Also, wc(1) (even sbase's implementation) uses -c to refer to bytes,
and -m to refer to characters. It wouldn't be self-consistent to make
tail use -b for bytes and -c for characters. (Just to clarify, I also
think it would be a really bad idea to make wc use -b for bytes and -c
for characters).

>> It all depends on how many scripts rely on this behaviour. Can you give
>> an example?

Sure. gcc's build system uses tail to skip the first 16 bytes of the
binaries to check that stage2 and stage3 are the same. Granted, it
does use non-standard syntax tail +16c, and I don't know that there
are any bytes in there with the high bit set, but still, tail *does*
get invoked on binary files, and treating the byte offsets as
characters will break things in strange ways that are difficult to
debug.

>> I thought cut(1) was the tool of choice for extracting
>> headers and such things.

How do you use cut(1) to strip the first 512 bytes of a binary file?
It operates on lines.

> I think deviating from POSIX here is a bad idea. Every deviation from
> POSIX means that our tools cannot be used in another situation and
> pushes prospective users away. If the user wants characters instead of
> bytes we have tools to do that, don't surprise the user by doing
> something different than every other implementation.
>
> P.S. I too found -c confusing the first time I expected utf8
> characters, but remembering these tools were created with ascii in
> mind, I think of -c as char and it all works out...

Agreed.



Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes

2016-12-27 Thread Evan Gates
On Tue, Dec 27, 2016 at 5:55 AM, Laslo Hunhold  wrote:
> well-spotted! Still, it's _very_ counterintuitive to call the flag
> "-c". Instead of adding a non-portable m-flag, it would even sound
> better to me to add a b-flag for byte-offsets.
>
> It all depends on how many scripts rely on this behaviour. Can you give
> an example? I thought cut(1) was the tool of choice for extracting
> headers and such things.

I think deviating from POSIX here is a bad idea. Every deviation from
POSIX means that our tools cannot be used in another situation and
pushes prospective users away. If the user wants characters instead of
bytes we have tools to do that, don't surprise the user by doing
something different than every other implementation.

P.S. I too found -c confusing the first time I expected utf8
characters, but remembering these tools were created with ascii in
mind, I think of -c as char and it all works out...



Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes

2016-12-27 Thread Laslo Hunhold
On Tue,  6 Dec 2016 02:17:03 -0800
Michael Forney  wrote:

Hey Michael,

> POSIX says that -c specifies a number of bytes, not characters. This
> flag is commonly used by scripts that operate on binary files to do
> things like extract a header. Treating the offsets as character
> offsets will break things in mysterious ways.
> 
> Instead, add a -m option (chosen to match `wc -m`, which also operates
> on characters) to handle character offsets.
> ---
> I'm tempted to just delete the character functionality instead of
> introducing a new non-standard option. I can see the use of tail with
> codepoints, but we definitely need to make -c work on bytes so that we
> don't break scripts.
> 
> I'm also open to changing the option flag to something else. I just
> chose -m because that's what wc uses for characters.

well-spotted! Still, it's _very_ counterintuitive to call the flag
"-c". Instead of adding a non-portable m-flag, it would even sound
better to me to add a b-flag for byte-offsets.

It all depends on how many scripts rely on this behaviour. Can you give
an example? I thought cut(1) was the tool of choice for extracting
headers and such things.

Cheers

Laslo

-- 
Laslo Hunhold