On Mon, Sep 2, 2019 at 9:56 PM Steven D'Aprano <st...@pearwood.info> wrote: > > On Sun, Sep 01, 2019 at 12:24:24PM +1000, Chris Angelico wrote: > > > Older versions of Python had text and bytes be the same things. > > Whether a string object is *text* is a semantic question, and > independent of what data format you use. 'Hello world!' is text, whether > you are using Python 1.5 or Python 3.8. '\x01\x06\x13\0' is not text, > whether you are using Python 1.5 or Python 3.8.
Okay, so "string" and "text" are completely different concepts. Hold that thought. > > That > > means that, for backward compatibility, they have some common methods. > > But does that really mean that bytes can be uppercased? > > I'm curious what you think that b'chris angelico'.upper() is doing, if > it is not uppercasing the byte-string b'chris angelico'. Is it a mere > accident that the result happens to be b'CHRIS ANGELICO'? > > Unicode strings are sequences of code-points, abstract integers between > 0 and 1114111 inclusive. When you uppercase the Unicode string 'chris > angelico', you're transforming the sequence of integers: > > U+0063,0068,0072,0069,0073,0020,0061,006e,0067,0065,006c,0069,0063,006f > > to this sequence of integers: > > U+0043,0048,0052,0049,0053,0020,0041,004e,0047,0045,004c,0049,0043,004f > > If you are prepared to call that "uppercasing", you should be prepared > to do the same for the byte-string equivalent. > > (For the avoidance of doubt: this is independent of the encoding used to > store those code points in memory or on disk. Encodings have nothing to > do with this.) No, they're not decoded. What happens is an *assumption* that certain bytes represent uppercaseable characters, and others do not. I specifically chose my example such that the corresponding code points both represented letters, and that the uppercased versions of each land inside the first 256 Unicode codepoints; yet uppercasing the bytestring changes one and not the other. Is it uppercasing the number 0x61 to create the number 0x41? No, it's assuming that it means "a" and uppercasing it to "A". > The formal definition of a string is a sequence of symbols from an > alphabet. That is precisely what bytes objects are: the alphabet in this > case is the 8-bit numbers 0 to 255 inclusive, which for usefulness, > convenience and backwards compatibility can be optionally interpreted as > the 7-bit ASCII character set plus another 128 abstract "characters". > > > > > I said they were *strings*. Strings are not necessarily text, although > > > they often are. Formally, a string is a finite sequence of symbols that > > > are chosen from a set called an alphabet. See: > > > > > > https://en.wikipedia.org/wiki/String_%28computer_science%29 > > > > A finite sequence of symbols... you mean like a list of integers > > within the range [0, 255]? Nothing in that formal definition says that > > a "string" of anything other than characters should be meaningfully > > treated as text. > > Sure. If your bytes don't represent text, then methods like upper() > probably won't do anything meaningful. It's still a string though. I specifically said a *list* of integers. Like what you'd get if you call list() on a bytestring. There's nothing in the formal definition you gave that precludes this from being considered a string, yet it is somehow, by your own words, fundamentally different. > > > > I don't think it's necessary to be too adamant about "must be some > > > > sort of thing-we-call-string" here. Let practicality rule, since > > > > purity has already waved a white flag at us. > > > > > > It is because of *practicality* that we should prefer that things that > > > look similar should be similar. Code is read far more often that it is > > > written, and if you read two pieces of code that look similar, we should > > > strongly prefer that they should actually be similar. > > > > And you have yet to prove that this similarity is actually a thing. > > I'm not sure the onus is on me to prove this. "Status quo wins a > stalemate." And surely the onus is on those proposing the new syntax to > demonstrate that it will be fine to use string delimiters as function > calls. Actually it is, because YOU are the one who said that quoted strings should be restricted to "string-like" things. Would a Path literal be sufficiently string-like to be blessed with double quotes? A regex literal? An IP header, represented as a bytestring? What's a string and what's not? Why are you trying to draw a line? > You could make a good start by finding other languages, reasonably > conventional languages with syntax based on the Algol or C tradition, > that use quotes '' or "" to return arbitrary types. I gave an example wherein a list/array is represented as ";foo;bar;quux" - does that count? (VX-REXX, if you're curious.) > Anyway, the bottom line is this: > > I have no objection to using prefixed quotes to represent Unicode > strings, or byte strings, or Andrew's hypothetical UTF-16 strings, or > EBCDIC strings, or TRON strings. > > https://en.wikipedia.org/wiki/TRON_(encoding) > > But I think that any API that would allow z"..." to represent (let's > say) a socket, or a float, or a HTTP_Server instance, or a list, would > be a deeply flawed API. What if it represents a "connectable endpoint"? Is that a string? It'd be kinda like a pathlib.Path but with a bit more flexibility, allowing it to define a variety of information including the method of connection and perhaps some credentials. IOW a URI. > > Let's look at regular expressions. JavaScript has a syntax for them > > involving leading and trailing slashes, borrowed from Perl, but I > > can't figure out whether a regex is a first-class object in Perl. So > > you can do something like this: > > > > > findme = /spa*m/ > > > "This has spaaaaaam in it".match(findme) > > [ 'spaaaaaam', index: 9, input: 'This has spaaaaaam in it' ] > > > > In Python, I can do the exact same thing, only using double quotes as > > the delimiter. > > > > >>> re.search("spa*m", "This has spaaaaam in it") > > <re.Match object; span=(9, 17), match='spaaaaam'> > > Sure. As a convenience, the re module has functions which accepts > regular expression patterns as well as compiled regular expression > objects. Exactly. To the re module, strings and compiled regexes are interchangeable. > > So what do you mean by "non-string" exactly? In what way is a regular > > expression "not a string", > > That question is ambiguous. Are you asking about regular expression > patterns, or regular expression objects? Both at once. We're discussing the possibility of a "regex literal" concept that may or may not use double quotes. To most human beings, a regular expression IS a text string. Is a compiled regex allowed to have a literal form that uses double quotes, based on your definition of "string-like"? YOU are the one who is trying to draw a line in the sand here. > > yet the byte-encoded form of an integer somehow is? > > If your bytes represent an integer, then uppercasing them isn't > meaningful. If your bytes represent ASCII text then uppercasing them > may be meaningful. Right, but even if they represent an integer, you're fine with them using double quotes. Or am I mistaken here, and you would prefer to see it represented as bytes((0xe7, 0x61)) ? > > Yet when you > > encode the string as bytes, it gains an upper() method, and when you > > encode a regex as a compiled regex object, it loses one. Why do you > > insist that a regex is somehow not a string, but b"\xe7\x61" is? > > Because a byte-string matches the definition of strings, while compiled > regex objects do not. And [0xe7, 0x61] also matches the definition of a string. ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JB67DYD5BOFE4YG3SDCMY4XXNAVOVZ3Q/ Code of Conduct: http://python.org/psf/codeofconduct/