On 2024-06-13 14:59, Jeremy Drake via Cygwin wrote:
Backstory: rust's test suite makes an oddly-named directory as part of a
test:
https://github.com/rust-lang/rust/blob/921645c737f1d6d107a0a10ca5ee129d364dcd7a/tests/run-make/non-unicode-in-incremental-dir/rmake.rs

When trying to clean up after a rust build/test with rm -rf, it results in
a "Directory not empty" error.

Suggest using rust uutils for this, as rust created it:

        https://uutils.github.io/coreutils/
        https://github.com/uutils/coreutils

Thankfully, this can be simply reproduced with the following two bash
commands (on cygwin 3.5.3):

mkdir -p foo/$'\uD800'
rm -rf foo

This fails with: rm: cannot remove 'foo': Directory not empty
when it should succeed.

That is questionable as that value is a reserved Unicode high surrogate for a Unicode character higher than UTF-16 was originally designed for.

$ mkdir -p foo/$'\uD800'
$ rm -rf  foo # /$'\uD800'
/bin/rm: cannot remove 'foo': Directory not empty
$ rm -rf  foo/$'\uD800'
removed directory 'foo/'$'\355\240\200'
$ rm -rf  foo
removed directory 'foo'

These reserved surrogate values should probably either be blocked, or encoded at the file system interface layer so they can be round tripped, like the Windows reserved characters, in the BMP or SMP PUAs.

Reserved surrogate ranges are D800-DBFF|DC00-DFFF.

Reserved noncharacters are U+FDD0-FDEF, and the last two code points of the BMP U+FFFE-FFFF, and each of the SMPs: U+{1-10}FFFE-{1-10}FFFF.

Allowed PUAs are U+E000-F8FF, U+F0000-FFFFD and U+100000-10FFFD.

Corinna? Opinions?

It would also be good to avoid the CSUR U+E000-E82F, U+F8A0-F8FF, U+F0000-F16AF:

        https://www.evertype.com/standards/csur/

and UCSUR U+E830-EDFF, U+F4C0-F4EF, U+F16B0-F1C9F, F1F00-F289F registry ranges:

        https://www.kreativekorp.com/ucsur/

as some fonts may render these, as they are used by applications.

These registry folks are major contributors to Unicode standards, and these efforts bring order to supporting, managing, and using minority, minor historical or ancient, undeciphered, or constructed (e.g. Mormon, Shaw, Tolkien, Le Guin, Star Trek, Star Wars) language scripts or writing systems with glyphs not (yet) officially assigned in Unicode Standards.

--
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                -- Antoine de Saint-Exupéry

--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to