Is the binaryness/textness of a data format a property?
Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is the name of this binaryness/textness property? Question #3: Here is another way of asking Question #2: Please fill in the following blanks with the property name (both blanks should be filled with the same thing): For the JPEG data format: _ = binary. For the CSV data format: _ = text. /Roger
RE: Why do binary files contain text but text files don't contain binary?
Based on a private correspondence, I now realize that this statement: > Text files do not contain binary is not correct. Text files may indeed contain binary (i.e., bytes that are not interpretable as characters). Namely, text files may contain newlines, tabs, and some other invisible things. Question: "characters" are defined as only the visible things, right? I conclude: Binary files may contain arbitrary text. Text files may contain binary, but only a restricted set of binary. Do you agree? /Roger From: Costello, Roger L. Sent: Friday, February 21, 2020 7:22 AM To: unicode@unicode.org Subject: Why do binary files contain text but text files don't contain binary? Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.) Why the asymmetry? /Roger
Why do binary files contain text but text files don't contain binary?
Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.) Why the asymmetry? /Roger
A neat description of encoding characters
>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, >p.74-75 Suppose that the alphabet with which we wish to concern ourselves consists of 256 distinct symbols. Imagine that we have a deck of 256 cards, each of which has a distinct symbol of our alphabet printed on it, and, of course, such that there corresponds one card to each symbol. How many questions that can be answered "yes" or "no" would one have to ask, given one card randomly selected from the deck, in order to be able to decide which character is printed on the card? We can certainly make the decision by asking at most 256 questions. We can somehow order the symbols and begin by asking if it is the first in our ordering, e.g., "It is an uppercase A?" If the answer is "no," then we ask if it is the second, and so on. But if our ordering is known both to ourselves and to our respondent, there is a much more economical way of organizing our questioning. We ask whether the character we are seeking is in the first half of the set. Whatever the answer, we will have isolated a set! of 128 characters among the character we seek resides. We again ask whether it is in the first half of that smaller set, and so on. Proceeding in this way, we are bound to discover what character is printed on the selected card by asking exactly eight questions. We could have recorded the answers we received to our questions by writing "1" whenever the answer was "yes" and "0" whenever it was "no." That record would then consist of eight so-called bits each of which is either "1" or "0". This eight-bit string is then an unambiguous representation of the character we are seeking. Moreover, each character of the whole set has a unique eight-bit representation within the same ordering.
Is the Unicode Standard "The foundation for all modern software and communications around the world"?
Hi Folks, Today I received an email from the Unicode organization. The email said this: (italics and yellow highlighting are mine) The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). That is a remarkable statement! But is it entirely true? Isn't it assuming that everything is text? What about binary information such as JPEG, GIF, MPEG, WAV; those are pretty core items to the Web, right? The Unicode Standard is silent about them, right? Isn't the above quote a bit misleading? /Roger
Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?
Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input file contains a string. The string is encoded in UTF-8. (d) Something else (what?) /Roger
Does "endian-ness" apply to UTF-8 characters that use multiple bytes?
Hello Unicode Experts! As I understand it, endian-ness applies to multi-byte words. Endian-ness does not apply to ASCII characters because each character is a single byte. Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character é would appear in a hex editor as C3 A9? /Roger
RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Hi Folks, Thank you for your outstanding responses! Below is a summary of what I learned. Are there any errors in the summary? Is there anything you would add? Please let me know of anything that is not clear. /Roger 1. While base64 encoding is usually applied to binary, it is also sometimes applied to text, such as Unicode text. Note: Since base64 encoding may be applied to both binary and text, in the following bullets I use the more generic term "data". For example, "Data d is base64-encoded to yield ..." 2. Neither base64 encoding nor decoding should presume any special knowledge of the meaning of the data or do anything extra based on that presumption. For example, converting Unicode text to and from base64 should not perform any sort of Unicode normalization, convert between UTFs, insert or remove BOMs, etc. This is like saying that converting a JPEG image to and from base64 should not resize or rescale the image, change its color depth, convert it to another graphic format, etc. If you use base64 for encoding MIME content (e.g. emails), the base64 decoding will not transform the content. The email parser must ensure that the content is valid, so the parser might have to transform the content (possibly replacing some invalid sequences or truncating), and then apply Unicode normalization to render the text. These transforms are part of the MIME application and are independent of whether you use base64 or any another encoding or transport syntax. 3. If data d is different than d', then the base64 text resulting from encoding d is different than the base64 text resulting from encoding d'. 4. If base64 text t is different than t', then the data resulting from decoding t is different than the data resulting from decoding t'. 5. For every data d there is exactly one base64 encoding t. 6. Every base64 text t is an encoding of exactly one data d. 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Hi Unicode Experts, Suppose base64 encoding is applied to m to yield base64 text t. Next, suppose base64 encoding is applied to m' to yield base64 text t'. If m is not equal to m', then t will not equal t'. In other words, given different inputs, base64 encoding always yields different base64 texts. True or false? How about the opposite direction: If m is base64 encoded to yield t and then t is base64 decoded to yield n, will it always be the case that m equals n? /Roger
RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Hi Folks, Thank you very much for your fantastic comments! Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files. Some questions: - Have I captured all your comments? Any more comments? - Are the proposed requirements sensible? Any more requirements? /Roger Issue: Folding and unfolding content lines in iCalendar files The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets: Lines of text SHOULD NOT be longer than 75 octets, excluding the line break. The RFC says that long lines should be folded: Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). The RFC says that, when parsing a content line, folded lines must first be unfolded: When parsing a content line, folded lines MUST first be unfolded. using this technique: Unfolding is accomplished by removing the CRLF and the linear white-space character that immediately follows. The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence: Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Here is an example of folding in the middle of a UTF-8 multi-octet sequence: The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer. Proposed requirements on the behavior of applications that receive iCalendar files: 1. (Bug) The receiving application does not recognize that it has received an iCalendar file. 2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them. 3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse.
Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Hello Unicode Experts! Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot unambiguously restore the original sequence? Here is the source of my question: The iCalendar specification [RFC 5545] says that long lines must be folded: Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). The RFC says that, when parsing a content line, folded lines must first be unfolded using this technique: Unfolding is accomplished by removing the CRLF and the linear white-space character that immediately follows. The RFC acknowledges that simple implementations might generate improperly folded lines: Note: It is possible for very simple implementations to generate improperly folded lines in the middle of a UTF-8 multi-octet sequence. For this reason, implementations need to unfold lines in such a way to properly restore the original sequence. Can you provide an example of folding a UTF-8 multi-octet sequence such that there is no unambiguous way to restore the original sequence? /Roger