Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes
On Tue, 27 Dec 2016 18:03:24 -0800 Michael Forneywrote: Hey Michael, > yes, the reasoning makes sense. Could you please update your patch so it applies on git HEAD? Cheers Laslo -- Laslo Hunhold
Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes
On 12/27/16, Evan Gateswrote: > On Tue, Dec 27, 2016 at 5:55 AM, Laslo Hunhold wrote: >> well-spotted! Still, it's _very_ counterintuitive to call the flag >> "-c". Instead of adding a non-portable m-flag, it would even sound >> better to me to add a b-flag for byte-offsets. Yes, it's a bit counter-intuitive, but conflicting with POSIX for this alone seems like a really bad idea. I always consult POSIX when writing shell scripts to ensure that they will run on any conforming system. If sbase decided that the option character name was not the best choice, then reasonable, valid, and portable scripts may start operating unexpectedly with no indication as to why. Also, wc(1) (even sbase's implementation) uses -c to refer to bytes, and -m to refer to characters. It wouldn't be self-consistent to make tail use -b for bytes and -c for characters. (Just to clarify, I also think it would be a really bad idea to make wc use -b for bytes and -c for characters). >> It all depends on how many scripts rely on this behaviour. Can you give >> an example? Sure. gcc's build system uses tail to skip the first 16 bytes of the binaries to check that stage2 and stage3 are the same. Granted, it does use non-standard syntax tail +16c, and I don't know that there are any bytes in there with the high bit set, but still, tail *does* get invoked on binary files, and treating the byte offsets as characters will break things in strange ways that are difficult to debug. >> I thought cut(1) was the tool of choice for extracting >> headers and such things. How do you use cut(1) to strip the first 512 bytes of a binary file? It operates on lines. > I think deviating from POSIX here is a bad idea. Every deviation from > POSIX means that our tools cannot be used in another situation and > pushes prospective users away. If the user wants characters instead of > bytes we have tools to do that, don't surprise the user by doing > something different than every other implementation. > > P.S. I too found -c confusing the first time I expected utf8 > characters, but remembering these tools were created with ascii in > mind, I think of -c as char and it all works out... Agreed.
Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes
On Tue, Dec 27, 2016 at 5:55 AM, Laslo Hunholdwrote: > well-spotted! Still, it's _very_ counterintuitive to call the flag > "-c". Instead of adding a non-portable m-flag, it would even sound > better to me to add a b-flag for byte-offsets. > > It all depends on how many scripts rely on this behaviour. Can you give > an example? I thought cut(1) was the tool of choice for extracting > headers and such things. I think deviating from POSIX here is a bad idea. Every deviation from POSIX means that our tools cannot be used in another situation and pushes prospective users away. If the user wants characters instead of bytes we have tools to do that, don't surprise the user by doing something different than every other implementation. P.S. I too found -c confusing the first time I expected utf8 characters, but remembering these tools were created with ascii in mind, I think of -c as char and it all works out...
Re: [hackers] [sbase] [PATCH 11/11] tail: Process bytes with -c option, and add -m option for runes
On Tue, 6 Dec 2016 02:17:03 -0800 Michael Forneywrote: Hey Michael, > POSIX says that -c specifies a number of bytes, not characters. This > flag is commonly used by scripts that operate on binary files to do > things like extract a header. Treating the offsets as character > offsets will break things in mysterious ways. > > Instead, add a -m option (chosen to match `wc -m`, which also operates > on characters) to handle character offsets. > --- > I'm tempted to just delete the character functionality instead of > introducing a new non-standard option. I can see the use of tail with > codepoints, but we definitely need to make -c work on bytes so that we > don't break scripts. > > I'm also open to changing the option flag to something else. I just > chose -m because that's what wc uses for characters. well-spotted! Still, it's _very_ counterintuitive to call the flag "-c". Instead of adding a non-portable m-flag, it would even sound better to me to add a b-flag for byte-offsets. It all depends on how many scripts rely on this behaviour. Can you give an example? I thought cut(1) was the tool of choice for extracting headers and such things. Cheers Laslo -- Laslo Hunhold