On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote: > There is a discussion over at MicroPython about the internal > representation of Unicode strings. Micropython is aimed at embedded > devices, and so minimizing memory use is important, possibly even > more important than performance. [...]
Wow! I'm amazed at the response here, since I expected it would have received a fairly brief "Yes" or "No" response, not this long thread. Here is a summary (as best as I am able) of a few points which I think are important: (1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option", and since nobody has really defended the suggestion, I think we can assume that it's off the table. (2) I asked if it would be okay for µPy to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's: Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module). but unless Guido wants to say different, I think the consensus is that a UTF-8 implementation is allowed, even at the cost of O(N) indexing operations. Saving memory -- assuming that it does save memory, which I think is an assumption and not proven -- over time is allowed. (3) It seems to me that there's been a lot of theorizing about what implementation will be obviously more efficient. Folks, how about some benchmarks before making claims about code efficiency? :-) (4) Similarly, there have been many suggestions more suited in my opinion to python-ideas, or even python-list, for ways to implement O(1) indexing on top of UTF-8. Some of them involve per-string mutable state (e.g. the last index seen), or complicated int sub-classes that need to know what string they come from. Remember your Zen please: Simple is better than complex. Complex is better than complicated. ... If the implementation is hard to explain, it's a bad idea. (5) I'm not convinced that UTF-8 internally is *necessarily* more efficient, but look forward to seeing the result of benchmarks. The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 in the first place saves the transcoding step. Well, yes, but many strings may never be written out: print(prefix + s[1:].strip().lower().center(80) + suffix) creates five strings that are never written out and one that is. So if the internal encoding of strings is more efficient than UTF-8, and most of them never need transcoding to UTF-8, a non-UTF-8 internal format might be a nett win. So I'm looking forward to seeing the results of µPy's experiments with it. Thanks to all who have commented. -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com