Stefan Persson wrote: > Let's say that I have two files, namely file1 & file2, in any Unicode > encoding, both starting with a BOM, and I compile them into > one by using > > cat file1 file2 > file3 > > in Unix or > > copy file1 + file2 file3 > > in MS-DOS, file3 will have the following contents: > > BOM > contents from file1 > BOM > contents from file2 > > Is this in accordance with the Unicode standard, or do I have > to remove the second BOM?
IMHO, Unicode should not specify such a behavior. Deciding what a shell command is supposed to do is a decision of the operating system, not of text encoding standards. BTW, consider that both Unix "cat" and DOS "copy" are not limited to Unicode text files. Actually, they are not even limited to text files at all: you could use them to concatenate a bitmap with a font with an HTML document with a spreadsheet... whether the result makes sense or not is up to you and/or to the applications that will process the resulting file. Probably, there should be two separate commands (or different options of the same command): to do a raw byte-by-byte concatenation, and to do an encoding-aware concatenation of text files. E.g., imagine a "cat" command with these extensions: Synopsis cat [ -... ] [ -R encoding ] { [ -F encoding ] file } Description: ... If neither -R or -F's are specified, the concatenation is done byte by byte. Options: ... -R specifies the encoding of the resulting *text* file; -F specifies the encoding of the following *text* file. You command above would now expand to something like this: cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3 Provided with information about the input encodings and the expected output encoding, "cat" could now correctly handle BOM's, endianness, new-line conventions, and even perform character set conversions. Without this extra info, "cat" would retain its good ol' byte-by-byte functionality. Similar options could be added to any Unix command potentially dealing with text files ("cp", "head", "tail", etc.), as well as to their equivalents in DOS or other operating systems. _ Marco