On Mon, Dec 11, 2023 at 02:00:49PM +0000, Albretch Mueller wrote: > Ach, yes! I forgot echo by default appends a new line character at > the end of every string it spits out. In order to suppress it you need > to use the "n" option: "echo -n ..." > > _FL_TYPE=" abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ ¡ § > ASCII ä ö ü ß Ä Ö Ü Text " > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > _FL_TYPE=$(echo "${_FL_TYPE}" | xargs) > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > _FL_TYPE=$(echo -n "${_FL_TYPE}" | tr --complement --squeeze-repeats > '[A-Za-z0-9.]' '_'); > echo "// __ \$_FL_TYPE: |${_FL_TYPE}|" > > // __ $_FL_TYPE: | abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here > ¿ ¡ § ASCII ä ö ü ß Ä Ö Ü Text | > // __ $_FL_TYPE: |abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ ¡ > § ASCII ä ö ü ß Ä Ö Ü Text| > // __ $_FL_TYPE: |abc_123_birdie_here_ASCII_Text|
OK. Tomas's analysis was better than mine in this case. Looks like CR was not the issue this time around. I do have some comments, though. 1) Many implementations of echo will interpret parts of their argument(s), in addition to processing options like -n. If you want to print a variable's contents to standard output without *any* interpretation, use printf. printf %s "$myvar" printf '%s\n' "$myvar" 2) As tomas already told you, the square brackets in tr -c -s '[A-Za-z0-9.]' _ are literal. You're using a command which will keep left and right square brackets in the input, *not* replacing them with underscores. This may not be what you want. 3) In locales other than C or POSIX, ranges like A-Z are *not* necessarily synonyms for [:upper:]. As I've already mentioned, GNU tr is known to contain bugs, so you're getting lucky here. The bugs in GNU tr happen to work the way you're expecting, so that A-Z is treated like [:upper:] when it should not be. If at some point in the future GNU tr is fixed to conform to POSIX, your script may break. The correct tr command you should be using if you want to retain accented letters (as defined in your locale) is: tr -c -s '[:alnum:].' _ If you want to discard accented letters, then either of these is OK: LC_COLLATE=C tr -c -s '[:alnum:].' _ LC_COLLATE=C tr -c -s 'A-Za-z0-9.' _ 4) The xargs command, which you used above, uses quotation mark characters as well as whitespace to define input words. Your example worked only because your input does not contain any single or double quotes. Here's a demonstration of A-Z not equating to [:upper:] using GNU sed, which is behaving correctly: unicorn:~$ x=' abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ ' unicorn:~$ printf '%s\n' "$x" | sed 's/[A-Z]//g' abc á é í ó ú ü ñ 123 birdie🐦here ¿ unicorn:~$ printf '%s\n' "$x" | LC_COLLATE=C sed 's/[A-Z]//g' abc á é í ó ú ü ñ Á É Í Ó Ú Ü Ñ 123 birdie🐦here ¿ The meaning of [A-Z] in the sed command depends on the locale. In my locale, which is en_US.utf8, characters like Á are part of the A-Z range. In the C locale, they aren't, as seen in the last command above. The use of [A-Z] in regular expressions and globs is a *very* heavily debated topic, and I'm only scratching the surface here. Honestly, you really should avoid using it. It's just too unpredictable. Here's an example of xargs failing when its input contains a quote: unicorn:~$ echo 'foo "bar' | xargs xargs: unmatched double quote; by default quotes are special to xargs unless you use the -0 option foo You can't use xargs to normalize whitespace safely. In fact, the proper way to normalize whitespace is... unicorn:~$ printf 'foo "bar \t\t \t baz \n' | tr -s ' \t' ' ' foo "bar baz Thus, we come full circle.