> On Oct 27, 2019, at 05:38, Steven D'Aprano <st...@pearwood.info> wrote:
> 
>> On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas 
>> wrote:
>> 
>> If you redesign your find, re.search, etc. APIs to not return 
>> character indexes, then I think you can get away with not having 
>> character-indexable strings.
> 
> If string.index(c) doesn't return the index of c in string, then what 
> does it return?
> 
> I think you are conflating the public API based on characters (to be 
> precise: code points) for some underlying implementation based on bytes. 

No, what I’m doing is avoiding conflating the public API based on characters 
with the underlying representation based on code points, treating them no more 
fundamental than the code units.

You can still iterate the code points if you want to, because that’s 
occasionally useful. And you can also iterate the UTF-8 code units, because 
that’s also occasionally useful.

> Given zero-based indexing, and the string:
> 
>    "abÇÐεф"
> 
> the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 
> (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door 
> with a pitchfork and a flaming torch *wink*

Really? Even if the string is in NFKD, as it would be if this were, say, the 
name of a file on a standard Mac file system? Then that Ç character is stored 
as the code unit U+0043 followed by the code unit U+0327, rather than the 
single unit U+00D0. So had it still better be 5, not 6? If so, Python 3 is 
broken, and always has been; where’s your pitchfork?

And what were you going to do with that 5 anyway that it has to be an integer? 
Without a use case, you’re just demanding infinite flexibility regardless of 
what the cost might be. You _could_ make this work by building a grapheme 
cluster index at construction time for every string, or by storing strings as 
an array of grapheme clusters that are themselves arrays of code points rather 
than as a flat array, or by normalizing every string at construction time. But 
do you actually want to do any of those things; or is guaranteeing 5 rather 
than 6 there not worth the cost?

Also, have you ever used seek and tell on a text file? What do you think tell 
gives you? According to the language spec; it could be anything and you have to 
treat it as an abstract index; I think in current CPython it’s a byte index. 
Where’s your pitchfork there?

> And returning <AbstractIndex object at 0xb7ce1bf0> is even worse.

Why?

That object can be used to index/slice/start finding at/etc.

I suggested earlier that it could also have attributes that give you the 
integer character, code unit (byte), and, if you really want it, code point 
index. If you have a use for one of those, you use the one you have a use for. 
If not, why do you need it to be equal to any of those three integers, much 
less the least useful of them?

If you’re just concerned about the REPL, then it can be <CharIndex(5) at 
0xb7ce1bf0>, or even something eval-able like CharIndex(chars=5, units=6, 
bytes=10). Which isn’t as nice as a number I can just spot a few lines back and 
retype (as I mentioned before, this is occasionally an annoyance when dealing 
with Swift), but that’s a tradeoff that allows you to see the number 5 that 
you’re insisting you’d better be able to get even though you can’t actually use 
the number 5.

> Strings might not be implemented as an array of characters. They could 
> be a rope, a linked list, a piece table, a gap buffer, or something 
> else. The public API which operates on code points should not depend on 
> the implementation. Regardless of how your string is implemented, it is 
> conceptually a sequential array of N code points indexed from 0 to N-1.

If you want a public API that’s independent of implementation, where a string 
could be a linked list, then you want a public API that doesn’t include 
indexing. If your language comes with fundamental builtin types where the [] 
operator takes linear time, then your language doesn’t have a [] operator like 
Python’s, or C++’s or most other languages with the same syntax; it has 
something that looks misleadingly like [] in other languages but has to be used 
differently.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MHOYCKINBLZKEITIAQVDP46U2RTWJ7US/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to