Is the binaryness/textness of a data format a property?

2020-03-20 Thread Costello, Roger L. via Unicode
Hello Data Format Experts!

[Definition] Property: an attribute, quality, or characteristic of something.

JPEG is a binary data format.
CSV is a text data format.

Question #1: Is the binaryness/textness of a data format a property? 

Question #2: If the answer to Question #1 is yes, then what is the name of this 
binaryness/textness property?

Question #3: Here is another way of asking Question #2: Please fill in the 
following blanks with the property name (both blanks should be filled with the 
same thing):

For the JPEG data format:  _ = binary.
For the CSV data format:  _ = text. 

/Roger



RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Based on a private correspondence, I now realize that this statement:



> Text files do not contain binary



is  not correct.



Text files may indeed contain binary (i.e., bytes that are not interpretable as 
characters). Namely, text files may contain newlines, tabs, and some other 
invisible things.



Question: "characters" are defined as only the visible things, right?



I conclude:



Binary files may contain arbitrary text.

Text files may contain binary, but only a restricted set of binary.



Do you agree?



/Roger


From: Costello, Roger L. 
Sent: Friday, February 21, 2020 7:22 AM
To: unicode@unicode.org
Subject: Why do binary files contain text but text files don't contain binary?

Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger


Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger


A neat description of encoding characters

2019-12-02 Thread Costello, Roger L. via Unicode
>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, 
>p.74-75

Suppose that the alphabet with which we wish to concern ourselves consists of 
256 distinct symbols. Imagine that we have a deck of 256 cards, each of which 
has a distinct symbol of our alphabet printed on it, and, of course, such that 
there corresponds one card to each symbol. How many questions that can be 
answered "yes" or "no" would one have to ask, given one card randomly selected 
from the deck, in order to be able to decide which character is printed on the 
card? We can certainly make the decision by asking at most 256 questions. We 
can somehow order the symbols and begin by asking if it is the first in our 
ordering, e.g., "It is an uppercase A?" If the answer is "no," then we ask if 
it is the second, and so on. But if our ordering is known both to ourselves and 
to our respondent, there is a much more economical way of organizing our 
questioning. We ask whether the character we are seeking is in the first half 
of the set. Whatever the answer, we will have isolated a set!
  of 128 characters among the character we seek resides. We again ask whether 
it is in the first half of that smaller set, and so on. Proceeding in this way, 
we are bound to discover what character is printed on the selected card by 
asking exactly eight questions. We could have recorded the answers we received 
to our questions by writing "1" whenever the answer was "yes" and "0" whenever 
it was "no." That record would then consist of eight so-called bits each of 
which is either "1" or "0". This eight-bit string is then an unambiguous 
representation of the character we are seeking. Moreover, each character of the 
whole set has a unique eight-bit representation within the same ordering. 



Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Costello, Roger L. via Unicode
Hi Folks,

Today I received an email from the Unicode organization. The email said this: 
(italics and yellow highlighting are mine)

The Unicode Standard is the foundation for all modern software and 
communications around the world, including all modern operating systems, 
browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, 
CSS, JSON, etc.).

That is a remarkable statement! But is it entirely true? Isn't it assuming that 
everything is text? What about binary information such as JPEG, GIF, MPEG, WAV; 
those are pretty core items to the Web, right? The Unicode Standard is silent 
about them, right? Isn't the above quote a bit misleading?

/Roger


Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Costello, Roger L. via Unicode
Hello Unicode experts!

Which is correct:

(a) The input file contains a string. The string is encoded using UTF-8.

(b) The input file contains a string. The string is encoded with UTF-8.

(c) The input file contains a string. The string is encoded in UTF-8.

(d) Something else (what?)

/Roger



Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Costello, Roger L. via Unicode
Hello Unicode Experts!

As I understand it, endian-ness applies to multi-byte words.

Endian-ness does not apply to ASCII characters because each character is a 
single byte.

Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), 
UTF-32BE and UTF32-LE because each character uses multiple bytes. 

Clearly endian-ness does not apply to single-byte UTF-8 characters. But what 
about UTF-8 characters that use multiple bytes, such as the character é, which 
uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in 
Little Endian would the character é appear in a hex editor as A9 C3 whereas if 
the file is in Big Endian the character é would appear in a hex editor as C3 A9?

/Roger



RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Costello, Roger L. via Unicode
Hi Folks,

Thank you for your outstanding responses! 

Below is a summary of what I learned. Are there any errors in the summary? Is 
there anything you would add? Please let me know of anything that is not clear. 
  /Roger

1. While base64 encoding is usually applied to binary, it is also sometimes 
applied to text, such as Unicode text.

Note: Since base64 encoding may be applied to both binary and text, in the 
following bullets I use the more generic term "data". For example, "Data d is 
base64-encoded to yield ..."

2. Neither base64 encoding nor decoding should presume any special knowledge of 
the meaning of the data or do anything extra based on that presumption. 

For example, converting Unicode text to and from base64 should not perform any 
sort of Unicode normalization, convert between UTFs, insert or remove BOMs, 
etc. This is like saying that converting a JPEG image to and from base64 should 
not resize or rescale the image, change its color depth, convert it to another 
graphic format, etc.

If you use base64 for encoding MIME content (e.g. emails), the base64 decoding 
will not transform the content. The email parser must ensure that the content 
is valid, so the parser might have to transform the content (possibly replacing 
some invalid sequences or truncating), and then apply Unicode normalization to 
render the text. These transforms are part of the MIME application and are 
independent of whether you use base64 or any another encoding or transport 
syntax.

3. If data d is different than d', then the base64 text resulting from encoding 
d is different than the base64 text resulting from encoding d'.

4. If base64 text t is different than t', then the data resulting from decoding 
t is different than the data resulting from decoding t'.

5. For every data d there is exactly one base64 encoding t.

6. Every base64 text t is an encoding of exactly one data d.

7. For all data d, Base64_Decode[Base64_Encode[d]] = d



Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Costello, Roger L. via Unicode
Hi Unicode Experts,

Suppose base64 encoding is applied to m to yield base64 text t. 

Next, suppose base64 encoding is applied to m' to yield base64 text t'.

If m is not equal to m', then t will not equal t'.

In other words, given different inputs, base64 encoding always yields different 
base64 texts.

True or false?

How about the opposite direction: If m is base64 encoded to yield t and then t 
is base64 decoded to yield n, will it always be the case that m equals n?

/Roger



RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hi Folks,

Thank you very much for your fantastic comments!

Below I summarized the issue and your comments. At the bottom is a set of 
proposed requirements (for my clients) on applications that receive iCalendar 
files.

Some questions:
 
- Have I captured all your comments? Any more comments?
- Are the proposed requirements sensible? Any more requirements? 

/Roger

Issue: Folding and unfolding content lines in iCalendar files

The iCalendar specification [RFC 5545] says that a content line should not be 
longer than 75 octets:

Lines of text SHOULD NOT be longer
than 75 octets, excluding the line break.
 
The RFC says that long lines should be folded:

Long content lines SHOULD be split
into a multiple line representations
using a line "folding" technique.
That is, a long line can be split between
any two characters by inserting a CRLF
immediately followed by a single linear
white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be 
unfolded:

When parsing a content line, folded lines MUST
first be unfolded. 

using this technique:

Unfolding is accomplished by  removing the
CRLF and the linear white-space character
that immediately follows. 

The RFC acknowledges that some implementations might do folding in the middle 
of a multi-octet sequence:

Note: It is possible for very simple
implementations to generate improperly
folded lines in the middle of a UTF-8
multi-octet sequence.  For this reason,
implementations need to unfold lines
in such a way to properly restore the
original sequence. 

Here is an example of folding in the middle of a UTF-8 multi-octet sequence: 

The iCalendar file contains the Yen sign (U+00A5), which is represented by the 
byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is 
folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, 
which isn't valid UTF-8 any longer.

Proposed requirements on the behavior of applications that receive iCalendar 
files:

1. (Bug) The receiving application does not recognize that it has received an 
iCalendar file.

2. (Bug) The sending application performs the folding process - inserts CRLF 
plus white space characters - and the receiving application does the unfolding 
process but doesn't properly delete all of them.

3. (Non-conformant behavior) The receiving application, after folding and 
before unfolding, attempts to interpret the partial UTF-8 sequences and convert 
them into replacement characters or worse.



Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hello Unicode Experts!

Suppose an application splits a UTF-8 multi-octet sequence. The application 
then sends the split sequence to a client. The client must restore the original 
sequence. 

Question: is it possible to split a UTF-8 multi-octet sequence in such a way 
that the client cannot unambiguously restore the original sequence?

Here is the source of my question:

The iCalendar specification [RFC 5545] says that long lines must be folded:

Long content lines SHOULD be split
into a multiple line representations
using a line "folding" technique.
That is, a long line can be split between
any two characters by inserting a CRLF
immediately followed by a single linear
white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be 
unfolded using this technique:

Unfolding is accomplished by removing
the CRLF and the linear white-space
character that immediately follows.

The RFC acknowledges that simple implementations might generate improperly 
folded lines:

Note: It is possible for very simple
implementations to generate improperly
folded lines in the middle of a UTF-8
multi-octet sequence.  For this reason,
implementations need to unfold lines
in such a way to properly restore the
original sequence.

Can you provide an example of folding a UTF-8 multi-octet sequence such that 
there is no unambiguous way to restore the original sequence? 

/Roger