On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > > > Strawman. > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when what > they really should say is "Yes, that's a good argument, I'm afraid I can't > argue against it, at least not without considerable thought", I'd be a > wealthy man...
Missed my addition? Here it is again – grammar slightly corrected. =========== Ah well if you insist on pursuing the nul-char example... - No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0 - No, the code that "can't cope with a perfectly normal character" is not wrong - It is C that is wrong for designing a buggy string data structure that cannot contain a valid char. =========== In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 – it is perhaps too strong even for me. To elaborate: Take the buggy-plane analogy I gave in http://blog.languager.org/2015/03/whimsical-unicode.html If a plane model crashes once in 10,000 flights compared to others that crash once in one million flights we can call it bug-prone though not strictly buggy – it does fly 9999 times safely! OTOH if a plane is guaranteed to crash we can all it a buggy plane. C's string is not bug-prone its plain buggy as it cannot represent strings with nulls. I would not go that far for UTF-16. It is bug-inviting but it can also be implemented correctly > > > > Lets please stick to UTF-16 shall we? > > > > Now tell me: > > - Is it broken or not? > > The UTF-16 standard is not broken. It is a perfectly adequate variable-width > encoding, and considerably better than most other variable-width encodings. > > However, many implementations of UTF-16 are faulty, and assume a > fixed-width. *That* is broken, not UTF-16. > > (The difference between specification and implementation is critical.) > > > > - Is it widely used or not? > > It's quite widely used. > > > > - Should programmers be careful of it or not? > > Programmers should be aware whether or not any specific language uses UTF-16 > and whether the implementation is buggy. That will help them decide whether > or not to use that language. > > > > - Should programmers be warned about it or not? > > I'm in favour of people having more knowledge rather than less. I don't > believe that ignorance is bliss, except perhaps in the case that a giant > asteroid the size of Texas is heading straight for us. > > Programmers should be aware of the limitations or bugs in any UTF-16 > implementation they are likely to run into. Hence my general > recommendation: > > - For transmission over networks or storage on permanent media (e.g. the > content of text files), use UTF-8. It is well-implemented by nearly all > languages that support Unicode, as far as I know. > > - If you are designing your own language, your implementation of Unicode > strings should use something like Python's FSR, or UTF-8 with tweaks to > make string indexing O(1) rather than O(N), or correctly-implemented > UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) FSR is possible in python for very specific pythonic reasons - dynamicness - immutable strings Drop either and FSR is impossible > If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed > 2-byte per code point format, you fail. Seems obvious enough. So lets see... Here's a 2-line python program -- runs well enough when run as a command. Program: ========= pp = "💩" print (pp) ========= Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File "/usr/bin/idle3", line 5, in <module> main() File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main if flist.open(filename) is None: File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open edit = self.EditorWindow(self, filename, key) File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__ EditorWindow.__init__(self, *args) File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__ if io.loadfile(filename): File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile self.text.insert("1.0", chars) File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert self.top.insert(index, chars, tags) File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert self.addcmd(InsertCommand(index, chars, tags)) File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd cmd.do(self.delegate) File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do text.insert(self.index1, self.chars, self.tags) File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert self.delegate.insert(index, chars, tags) File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__ return self.tk_call(self.orig_and_operation + args) _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl So who/what is broken? > > - If you are using an existing language, be aware of any bugs and > limitations in its Unicode implementation. You may or may not be able to > work around them, but at least you can decide whether or not you wish to > try. > > - If you are writing your own file system layer, it's 2015 fer fecks sake, > file names should be Unicode strings, not bytes! (That's one part of the > Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file > system, whichever you please, but again remember that both are > variable-width formats. Correct. Windows is broken for using UTF-16 Linux is broken for conflating UTF-8 and byte string. Lot of breakage out here dont you think? May be related to the equation UTF-16 = UCS-2 + Duct-tape ?? -- https://mail.python.org/mailman/listinfo/python-list