Re: Change a file name - remove a consistent string recursively

Craig Sanders via luv-main Mon, 16 Jan 2023 17:38:06 -0800

On Tue, Jan 17, 2023 at 12:01:35AM +1100, Les Kitchen wrote:
> On Mon, Jan 16, 2023, at 21:42, Craig Sanders via luv-main wrote:
> > On Fri, Jan 13, 2023 at 10:39:02PM +1100, Les Kitchen wrote:
> >> I'd do something like:
> >>
> >> find /Dir1 -type f | perl -lne '$o=$_; s/\.junk\././; print("mv -i $o $_") 
> >> if $_ ne $o;'
>
> Thanks, Craig, for your followup.
>
> > This is quite dangerous, for several reasons.  To start with, there's no
> > protection against renaming files over existing files with the same target
> > name.
> ...
>
> Well, that's the intention of the -i (interactive) option to mv,
> to require user agreement before over-writing existing files.


Which gets tedious real fast with more than a few files to confirm.


> All the other points you raise are valid, especially the dangers of feeding
> unchecked input into the shell, and anybody writing shell code needs to be
> aware of them — although I will say I mentioned pretty much all of them in
> the notes further down in my message, though in less detail than you have,
> and without stressing enough the dangers.

Yeah, i noticed that - I just thought it needed to be emphasised and explained
in more detail. These issues are the source of a lot of really serious bugs in
shell scripts & one-liners.


> And, yes, if you have filenames with arbitrary characters, then you have to
> resort to other contrivances, ultimately to NULL-terminated strings.

Using NUL as the separator isn't a "contrivance".  It's standard good practice
- use a delimiter that ISN'T (and preferably CAN'T be) in your input.  Since
NUL is the only character that can't be in a path/filename, that's the only
character to use.  It works whether you've got annoying characters in the
filenames or not. No special cases or special handling required. It just
works.


> And, yes, if you have a huge number of files, then you'd likely want to do
> the rename internal to your scripting language, instead of forking a new
> process for each file rename.  But then you lose the easy ability to review
> the commands before they're executed.

It's not difficult to change a print() statement to a rename() statement, or
to have both and comment out the rename until you've verified the output (i.e.
a simple "dry-run").

> And I could also mention the potential for filenames to contain UTF-8
> (or other encodings) for characters that just happen to look like ASCII
> characters, but aren't, or to contain terminal-control escape sequences.  It
> can get very weird.

While there's a handful of problematic unicode characters (mostly the extra
whitespace characters), in general unicode is not a problem. Especially if
you use NUL and/or proper quoting and/or arrays (e.g. `find` in combination
with the bash/ksh/zsh builtin mapfile/readarray and process substitution is
extremely useful - mapfile also supports NUL as the delimiter, another great
method of eliminating whitespace & quoting bugs).

> In general, there's a big big difference between a simple shell one-liner
> that you use as a work amplifier in situations you know are well-behaved,
> and a piece of robust code that can behave gracefully no matter what weird
> input is thrown at it.  They're different use-cases.

It's not hard to write robust one-liners. It just takes practice - a matter
of developing good habits and stomping on bad habits until it's automatic.

And using tools like shellcheck to highlight common mistakes and bad
practices helps a lot - it's available as a command-line tool and as a
paste-your-code-here web service. https://www.shellcheck.net/

It's packaged for debian and probably most other distros and is, IMO,
essential for any shell user, even if (especially if!) you're just dabbling
with the simplest of shell scripts or one-liners.  I wish it had been around
when I was learning shell - I look at some of the shell code I wrote years ago
and just shudder at how awful it is.  I got better with practice, though :) I
made a lot of those mistakes because I simply didn't know they were mistakes,
didn't know how dangerous they were, didn't know any better at the time.
shellcheck solves that problem.

Package: shellcheck
Description-en: lint tool for shell scripts
 The goals of ShellCheck are:
 .
  * To point out and clarify typical beginner's syntax issues,
    that causes a shell to give cryptic error messages.
 .
  * To point out and clarify typical intermediate level semantic problems,
    that causes a shell to behave strangely and counter-intuitively.
 .
  * To point out subtle caveats, corner cases and pitfalls, that may cause an
    advanced user's otherwise working script to fail under future circumstances.


Hastily written one-liners often lead to questions like "WTF happened to my
data?", "How can I reverse this 'sed -i' command I just ran?", and "Is it
possible to undelete files on ext4?"

> > Worse, it will break if any filenames contain whitespace characters 
> > (newlines,
> > tabs, spaces, etc - all of which are completely valid in filenames - the 
> > ONLY
> > characters guaranteed NOT to be in a pathname are / and NUL).
                                         ^^^^^^^^

oops, i meant filename. pathnames can contain /, of course. neither filenames
nor paths can contain NUL.


> This should be taped to the screen of every shell user.

yep.

> Actually, files with spaces are pretty common in non-Unix environments, like
> Windows or MacOS (yes, I know it's Unix underneath), but they're pretty
> simple to handle by double-quoting, as I mentioned in my notes — and that
> will handle pretty much everything except for characters that interpolate
> into double-quoted strings, I guess $ ` (backtick), and possibly !.

Backtick has been deprecated for years.  Use $() for command substitution
instead. Unlike backticks, it can even be nested (and quotes inside a nesting
level are isolated from quotes outside it).

e.g. the following will work:

var="$(grep "$(printf '%q' "$pattern" | tr "a-z" "A-Z")" input.txt)"

you can't do that with backticks.


> It's even messier than this.  Because you're already using single quotes for
> the -e expression to Perl, you can't immediately use them like that.  You
> have to do something like '"'"' to close the single-quoted string,

or, much quicker to type (no shift key needed - 4 keystrokes instead of 7):

'\'' - end-single-quote, escaped single-quote, start-single-quote again.


> then attach a double-quoted single quote, then open a new single-quoted
> string.  I don't even want to think about it.

yeah, there's so much that can go wrong so easily with shell quoting, with
just one small lapse of focus.

> By then you might as well shift to using NULL-terminated strings.

Or a different language. perl or awk, for example. or C. almost any language
that isn't shell. even python if you don't mind a language that seems to make
a deliberate point of breaking compatibility with every minor release (i like
python a lot but it's so damn frustrating dealing with compatibility issues
all the time).

shell's great at co-ordinating the execution of other programs to do "work"
(such as text or other data processing), that's what it's for, but is lousy at
actually doing that work itself.


> > For safety, if you were to DIY it with a command like yours above (there are
> > far better alternatives), you should use -print0 with find and the -0 option
> > with perl.

> > In fact, you should use NUL as the separator with ANY program dealing with
> > arbitrary filenames on stdin - most standard tools these days have -0 (or
> > -z or -Z) options for using NUL as the separator, including most of GNU
> > coreutils etc (head, tail, cut, sort, grep, sed, etc. For awk, you can use
> > BEGIN {RS="\0"} or similar).
>
> I'm in full agreement with all this, except that life's much easier if you
> know for sure that you're working only with well-behaved files — which is
> often the case — then you can use most of the standard utilities in their
> simple forms.

I have to disagree there. It's not any harder to use NUL as the separator,
and it's always best practice to program defensively - assume the worst-case
most maliciously constructed filenames and just deal with it.

In my experience, it's actually easier to write good code (proper quoting, NUL
separators, arrays, etc) in shell because doing so eliminates entire classes
of problems that you'd otherwise have to deal with, requiring more code to
handle all the special cases. Good habits make for good code.


> And I guess if you're worried, it's pretty straightforward to write
> (say) a find command that will go through some directory structure you
> plan to work on, and complain if it contains any ill-behaved file names.
> Alternatively, you could use the -path or -regex options to find to match
> only on well-behaved file-names for processing, or put a regular-expression
> match into the Perl expression.  And "well–behaved" depends on what
> possibilities you want to deal with.

True, you can do stuff like that.  But why bother when it's **easy** to use
NUL as the filename separator, and easy to properly quote your variables and
command-line args?  Not doing so is just more work and more code to write. And
then rewrite when you realise that it's futile to expect that filenames won't
have annoying characters in them.

It takes more energy to work around that - or even just complain about it -
than it takes to just deal with it properly.


> Really, though, my main point is that instead of writing code to do
> something immediately (which might be dangerous), you can write code to
> generate a simple list of shell commands, which you can review before
> executing (paged say through less if you have more than a screenful).

No argument there, generating shell code from perl or awk or another shell
script is something I've done since the 90s - pipe into less (and nowadays
shellcheck too) to review, then run the generator again and pipe into sh or
bash to execute.

and, btw, using bash built-in printf's %q format instead of echo or printf's
%s, makes this much safer - from "help printf" in bash:

    %q        quote the argument in a way that can be reused as shell input

Unlike old tricks like using sed to add quoting around strings, this actually
works reliably no matter what's in the string.


For anything even moderately complex though, or more than a screenful or so of
code, I'd rather use perl.

craig
_______________________________________________
luv-main mailing list -- luv-main@luv.asn.au
To unsubscribe send an email to luv-main-le...@luv.asn.au

Re: Change a file name - remove a consistent string recursively

Reply via email to