Marko Rauhamaa <ma...@pacujo.net> writes: > David Kastrup <d...@gnu.org>: >> It's still irrelevant since split does not _use_ the existing file name >> for constructing new file names. > > Split was just an example of a command that concatenates bytes sequences > to get pathnames, nothing more. > > Such concatenation is commonplace in Linux programs of all kinds. > > And the point of bringing concatenation into the discussion was that > remapping byte sequences to byte sequences breaks concatenation > additivity: > > U(x) + U(y) = U(x + y)
But Emacs' implementation doesn't in any respect "break concatenation additivity". If you split an arbitrary byte stream (including material invalid as UTF-8) at an arbitrary point (including in the middle of an UTF-8 character), decode the resulting pieces as UTF-8 (as one of several "reversible" encodings Emacs can interpret), concatenate the resulting Emacs strings and reencode the result as UTF-8 (since you actually need to provide a byte sequence to open(1) or similar), you will retain the original byte stream. No ifs and buts. The _decoded_ concatenated string might differ from decoding the unsplit byte string: it might contain "byte 0xc2, byte 0x80" (represented as 0xc1 0x82 0xc0 0x80) at the concatenation point rather than "character 0x80" (represented as 0xc2 0x80). But the moment you use this concatenation of half-sequences as a file name, it gets reencoded into the bytes 0xc2 and 0x80 and works just fine. -- David Kastrup