> On Oct 27, 2019, at 05:38, Steven D'Aprano <st...@pearwood.info> wrote: > >> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas >> wrote: >> >> If you redesign your find, re.search, etc. APIs to not return >> character indexes, then I think you can get away with not having >> character-indexable strings. > > If string.index(c) doesn't return the index of c in string, then what > does it return? > > I think you are conflating the public API based on characters (to be > precise: code points) for some underlying implementation based on bytes.
No, what I’m doing is avoiding conflating the public API based on characters with the underlying representation based on code points, treating them no more fundamental than the code units. You can still iterate the code points if you want to, because that’s occasionally useful. And you can also iterate the UTF-8 code units, because that’s also occasionally useful. > Given zero-based indexing, and the string: > > "abÇÐεф" > > the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 > (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door > with a pitchfork and a flaming torch *wink* Really? Even if the string is in NFKD, as it would be if this were, say, the name of a file on a standard Mac file system? Then that Ç character is stored as the code unit U+0043 followed by the code unit U+0327, rather than the single unit U+00D0. So had it still better be 5, not 6? If so, Python 3 is broken, and always has been; where’s your pitchfork? And what were you going to do with that 5 anyway that it has to be an integer? Without a use case, you’re just demanding infinite flexibility regardless of what the cost might be. You _could_ make this work by building a grapheme cluster index at construction time for every string, or by storing strings as an array of grapheme clusters that are themselves arrays of code points rather than as a flat array, or by normalizing every string at construction time. But do you actually want to do any of those things; or is guaranteeing 5 rather than 6 there not worth the cost? Also, have you ever used seek and tell on a text file? What do you think tell gives you? According to the language spec; it could be anything and you have to treat it as an abstract index; I think in current CPython it’s a byte index. Where’s your pitchfork there? > And returning <AbstractIndex object at 0xb7ce1bf0> is even worse. Why? That object can be used to index/slice/start finding at/etc. I suggested earlier that it could also have attributes that give you the integer character, code unit (byte), and, if you really want it, code point index. If you have a use for one of those, you use the one you have a use for. If not, why do you need it to be equal to any of those three integers, much less the least useful of them? If you’re just concerned about the REPL, then it can be <CharIndex(5) at 0xb7ce1bf0>, or even something eval-able like CharIndex(chars=5, units=6, bytes=10). Which isn’t as nice as a number I can just spot a few lines back and retype (as I mentioned before, this is occasionally an annoyance when dealing with Swift), but that’s a tradeoff that allows you to see the number 5 that you’re insisting you’d better be able to get even though you can’t actually use the number 5. > Strings might not be implemented as an array of characters. They could > be a rope, a linked list, a piece table, a gap buffer, or something > else. The public API which operates on code points should not depend on > the implementation. Regardless of how your string is implemented, it is > conceptually a sequential array of N code points indexed from 0 to N-1. If you want a public API that’s independent of implementation, where a string could be a linked list, then you want a public API that doesn’t include indexing. If your language comes with fundamental builtin types where the [] operator takes linear time, then your language doesn’t have a [] operator like Python’s, or C++’s or most other languages with the same syntax; it has something that looks misleadingly like [] in other languages but has to be used differently. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/MHOYCKINBLZKEITIAQVDP46U2RTWJ7US/ Code of Conduct: http://python.org/psf/codeofconduct/