On 7/15/2018 7:37 AM, Marko Rauhamaa wrote:

One of the classic Unix and Internet tenets is that text is bytes is
text.

Tenets of a faith may be wrong ;-). An informatic paradigm from more than 45 years ago may be outdated and in need of revision.

On byte storage and on the Internet, **everything** is (encoded) bytes, so saying 'text is bytes' says nothing because it is trivially true. On the other hand, 'bytes is text' is wrong unless one uses a character encoding that assigns a visible character (including <space>) to every byte. I believe both PCs and Macs had 1 or more such encodings. (I am only uncertain as to whether b'\x00' was mapped.)

Images are bytes as much as text is. I suggest that 'bytes is image' is more true than 'bytes is text'. Every byte can be mapped, for instance, into an 8 x 1 or 1 x 8 pixel image after deciding which end gets the high and low bits. Bit mapping is likely older than Unix. Bar codes and QR codes are commonplace as international machine-readable images of bytes.

In a context where 'everything is bytes', then 'bytes is everything' or 'bytes can be anything' are the proper reverses.

Of course, much of it was naïve, but UTF-8 has miraculously given
it a new life.

UTF-8 makes 'bytes is text' even less true. Not only are some leading bytes not text, but some byte sequences are illegal. Bytes are not UTF-8 text. As n increases, the probability that a string of n random bytes will be utf-8 text approaches 0 faster than interpreting the same bytes as Latin1.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to