Re: [Python-Dev] Python and the Unicode Character Database
2010/12/7 Alexander Belopolsky alexander.belopol...@gmail.com: On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote: I actually wonder if Python's re module can claim to provide even Basic Unicode Support. Do you really wonder? Most definitely it does not. Were you more optimistic four years ago? http://bugs.python.org/issue1528154#msg54864 I was hoping that regex syntax would be useful in explaining/documenting Python text processing routines (including string to number conversions) without a heavy dose of Unicode terminology. The new regex version http://bugs.python.org/issue2636 supports much more features, including unicode properties, and the mentioned possix classes etc. but definitely not all of the requirements of that rather generous list. http://www.unicode.org/reports/tr18/ It seems, e.g. in Perl, there are some omissions too http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level Do you know of any re engine fully complying to to tr18, even at the first level: Basic Unicode Support? vbr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com wrote: .. It seems, e.g. in Perl, there are some omissions too http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level Do you know of any re engine fully complying to to tr18, even at the first level: Basic Unicode Support? I would say Perl comes very close. At least it explicitly documents the missing features and offers workarounds in its reference manual. I am actually not as concerned about missing features as I am about non-conformance in the widely used features such as digits' matching with '\d'. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 07.12.2010 04:03, schrieb Alexander Belopolsky: On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote: I actually wonder if Python's re module can claim to provide even Basic Unicode Support. Do you really wonder? Most definitely it does not. Were you more optimistic four years ago? http://bugs.python.org/issue1528154#msg54864 Not at all. I thought back then, and think now, that Python should, but doesn't, support TR#18. I don't view that lack as a severe problem, though, and apparently none of the other contributors did so, either. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com wrote: .. Do you know of any re engine fully complying to to tr18, even at the first level: Basic Unicode Support? ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2. http://userguide.icu-project.org/strings/regexp ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
2010/12/7 Alexander Belopolsky alexander.belopol...@gmail.com: On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom vlastimil.b...@gmail.com wrote: .. Do you know of any re engine fully complying to to tr18, even at the first level: Basic Unicode Support? ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2. http://userguide.icu-project.org/strings/regexp Thanks for the pointer, I wasn't aware of that project. Anyway I am quite happy with the mentioned regex library and can live with trading this full compliance for some non-unicode goodies (like unbounded lookbehinds etc.), but I see, it's beyond the point here. Not that my opinion matters, but I can't think of, say, union, intersection and set-difference of Unicode sets as a basic feature or consider it a part of a minimal level for useful Unicode support. vbr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Sat, Dec 4, 2010 at 5:58 PM, Martin v. Löwis mar...@v.loewis.de wrote: I actually wonder if Python's re module can claim to provide even Basic Unicode Support. Do you really wonder? Most definitely it does not. Were you more optimistic four years ago? http://bugs.python.org/issue1528154#msg54864 I was hoping that regex syntax would be useful in explaining/documenting Python text processing routines (including string to number conversions) without a heavy dose of Unicode terminology. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Antoine Pitrou writes: Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit : Antoine Pitrou writes: The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right. Uhmm, the argument *for* this feature proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to. As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this. *I* didn't say we *should* support it. I'm saying that *others'* argument for not restricting the formats accepting by string to number converters to something well-defined and AFAIK universally understood by users (developers of Python programs *and* end-users) is that we *already* support this. Alexander, Martin, and I are basically just pointing out that no, the support we have via the built-in numeric constructors is incomplete and nonconforming. We feel that is a bug to be fixed by (1) implementing the definition as currently found in the documents, and (2) moving the non-ASCII support to another module (or, as a compromise, supporting non-ASCII digits via an argument to the built-ins -- that was my proposal, I don't know if Alexander or Martin would find it acceptable). Given that some committers (MAL, you?) don't even consider that accepting and converting a string containing digits from multiple scripts as a single number is a bug, I'd really rather that this bug/feature not be embedded in the interpreter. I suppose that as a built-in rather than syntax, technically it doesn't fall under the moratorium, but it makes me nervous ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le samedi 04 décembre 2010 à 17:13 +0900, Stephen J. Turnbull a écrit : Antoine Pitrou writes: Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit : Antoine Pitrou writes: The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right. Uhmm, the argument *for* this feature proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to. As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this. *I* didn't say we *should* support it. I'm saying that *others'* argument for not restricting the formats accepting by string to number converters to something well-defined and AFAIK universally understood by users (developers of Python programs *and* end-users) is that we *already* support this. As far as I can parse your sentence, I think you are mistaken. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Fri, Dec 3, 2010 at 12:10 AM, Alexander Belopolsky alexander.belopol...@gmail.com wrote: .. I don't think decimal module should support non-European decimal digits. The only place where it can make some sense is in int() because here we have a fighting chance of producing a reasonable definition. The motivating use case is conversion of numerical data extracted from text using simple '\d+' regex matches. It turns out, this use case does not quite work in Python either: re.compile(r'\s+(\d+)\s+').match(' \u2081\u2082\u2083 ').group(1) '₁₂₃' int(_) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'decimal' codec can't encode character '\u2081' in position 0: invalid decimal Unicode string This may actually be a bug in Python regex implementation because Unicode standard seems to recommend that '\d' be interpreted as gc = Decimal_Number (Nd): http://unicode.org/reports/tr18/#Compatibility_Properties I actually wonder if Python's re module can claim to provide even Basic Unicode Support. http://unicode.org/reports/tr18/#Basic_Unicode_Support Here is how I would do it: 1. String x of non-European decimal digits is only accepted in int(x), but not by int(x, 0) or int(x, 10). 2. If x contains one or more non-European digits, then (a) all digits must be from the same block: def basepoint(c): return ord(c) - unicodedata.digit(c) all(basepoint(c) == basepoint(x[0]) for c in x) - True (b) and '+' or '-' sign is not alowed. 3. A character c is a digit if it matches '\d' regex. I think this means unicodedata.category(c) - 'Nd'. Condition 2(b) is important because there is no clear way to define what is acceptable as '+' or '-' using Unicode character properties and not all number systems even support local form of negation. (It is also YAGNI.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
I actually wonder if Python's re module can claim to provide even Basic Unicode Support. Do you really wonder? Most definitely it does not. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Stephen J. Turnbull: Will it accept Arabic on input? (Han might be too much to ask for since Unicode considers Han digits to be impure.) I couldn't find a direct way to input Arabic digits into OO Calc, the normal use of Alt+number didn't work in Calc although it did in WordPad where Alt+1632 is ٠ and so on. OO Calc does have settings in the Complex Text Layout section for choosing different numerals but I don't understand the interaction of choices here. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote: .. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does. Well, programs that take input from the keyboards I posted in this thread will have to deal with the digits. Since Python's input() accepts keyboard input, you have your use case :-) Seriously, I find the distinction between input and output forms of numerals somewhat misguided. Any output can also serve as input. For books and other printed material, images, etc. you have scanners and OCR. For screen output you have screen readers. For spreadsheets and data, you have CSV, TSV, XML, etc. etc. etc. Just for the fun of it, I created a CSV file with Thai and Dzongkha numerals (in addition to Arabic ones) using OpenOffice. Here's the cut and paste version: Numbers in various scripts Arabic ThaiDzongkha 1 ๑ ༡ 2 ๒ ༢ 3 ๓ ༣ 4 ๔ ༤ 5 ๕ ༥ 6 ๖ ༦ 7 ๗ ༧ 8 ๘ ༨ 9 ๙ ༩ 10 ๑๐ ༡༠ 11 ๑๑ ༡༡ 12 ๑๒ ༡༢ 13 ๑๓ ༡༣ 14 ๑๔ ༡༤ 15 ๑๕ ༡༥ 16 ๑๖ ༡༦ 17 ๑๗ ༡༧ 18 ๑๘ ༡༨ 19 ๑๙ ༡༩ 20 ๒๐ ༢༠ And here's the script that goes with it: import csv c = csv.reader(open('Numbers-in-various-scripts.csv')) headers = [c.next() for i in range(3)] while c: print [int(unicode(x, 'utf-8')) for x in c.next()] and the output using Python 2.7: [1, 1, 1] [2, 2, 2] [3, 3, 3] [4, 4, 4] [5, 5, 5] [6, 6, 6] [7, 7, 7] [8, 8, 8] [9, 9, 9] [10, 10, 10] [11, 11, 11] [12, 12, 12] [13, 13, 13] [14, 14, 14] [15, 15, 15] [16, 16, 16] [17, 17, 17] [18, 18, 18] [19, 19, 19] [20, 20, 20] If you need more such files, I can generate as many as you like ;-) I can send the OOo file as well, if you like to play around with it. I'd say: case closed :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 03 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ Numbers in various scripts,, ,, Arabic,Thai,Dzongkha 1,à¹,༡ 2,à¹,༢ 3,à¹,༣ 4,à¹,༤ 5,à¹,༥ 6,à¹,༦ 7,à¹,༧ 8,à¹,༨ 9,à¹,༩ 10,à¹à¹,༡༠11,à¹à¹,༡༡ 12,à¹à¹,༡༢ 13,à¹à¹,༡༣ 14,à¹à¹,༡༤ 15,à¹à¹,༡༥ 16,à¹à¹,༡༦ 17,à¹à¹,༡༧ 18,à¹à¹,༡༨ 19,à¹à¹,༡༩ 20,à¹à¹,༢༠___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit : Antoine Pitrou writes: The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right. Uhmm, the argument *for* this feature proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to. As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Stephen J. Turnbull: Here's why: '''print %d % some_integer''' doesn't now, and never will (unless Kristan gets his Python 2.8wink), produce Arabic or Han numerals. Not in any language I know of, not in Microsoft Excel, and definitely not in Python 2. While I don't have Excel to test with, OpenOffice.org Calc will display in Arabic or Han numerals using the NatNum format codes. http://www.scintilla.org/ArabicNumbers.png Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT. Note that there's plenty of space for them in that code table (eg, 0xB0-0xB9 is empty). Apparently nobody *ever* thought it was useful to have them! DOS code page 864 does use 0xB0-0xB9 for ٠ .. ٩. http://www.ascii.ca/cp864.htm Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 01.12.2010 23:39, schrieb Martin v. Löwis: As of today, What’s New In Python 3.2 [1] does not even mention the unicodedata upgrade to 6.0.0. One reason was that I was instructed not to change What's New a few years ago. Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst? Rules for maintenance: * Anyone can add text to this document. Do not spend very much time on the wording of your changes, because your text will probably get rewritten to some degree. * The maintainer will go through Misc/NEWS periodically and add changes; it's therefore more important to add your changes to Misc/NEWS than to this file. * This is not a complete list of every single change; completeness is the purpose of Misc/NEWS. Some changes I consider too small or esoteric to include. If such a change is added to the text, I'll just remove it. (This is another reason you shouldn't spend too much time on writing your addition.) * If you want to draw your new text to the attention of the maintainer, add 'XXX' to the beginning of the paragraph or section. * It's OK to just add a fragmentary note about a change. For example: XXX Describe the transmogrify() function added to the socket module. The maintainer will research the change and write the necessary text. * You can comment out your additions if you like, but it's not necessary (especially when a final release is some months away). * Credit the author of a patch or bugfix. Just the name is sufficient; the e-mail address isn't necessary. It's helpful to add the issue number: XXX Describe the transmogrify() function added to the socket module. (Contributed by P.Y. Developer; :issue:`12345`.) This saves the maintainer the effort of going through the SVN log when researching a change. Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
2010/12/2 Stephen J. Turnbull step...@xemacs.org: Because that works, but print(T1234) doesn't (it prints ASCII). You can't round-trip, but users will want/expect that. You should be able to round-trip, absolutely. I don't think you should expect print() to do that. str(56) possibly. :) That's an argument for it to be in a module, as you then would need to send in a parameter on which decimal characters you want. T1000 = float('一.◯◯◯') That was already discussed here, and it's clear that unicode does not consider these characters to be something you can use in a decimal number, and hence it's not broken. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky alexander.belopol...@gmail.com wrote: Both my personal observations when travelling from Turkey to India and Wikipedia say yes. When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts. https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order. That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky alexander.belopol...@gmail.com wrote: .. This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order. That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library. I am not so sure. On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9 and Arabic-Indic decimals differently: print('\u200F123') 123 print('\u200F\u0661\u0662\u0663') 231 I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate the point. Cut-n-paste does not work well in the presence of RTL directives. and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both: print('\u202E123') 321 print('\u202E\u0661\u0662\u0663') 321 (again, the output display is simulated not copied.) I don't know if explicit RTL directives are ever used in Arabic texts, but it is quite possible that texts converted from older formats would use them for efficiency. Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. Nobody will ever assume that python builtins are suitable for use with all these variants. This feature is only good for nefarious purposes such as hiding extra digits in innocent-looking files or smuggling binary data through naive interfaces. PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it int('\u06DD\u0661\u0662')? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit : Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it? The fact that mixed rtl + ltr can render bizarrely or is awkward to cut and paste is quite off-topic for our discussion. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. So why do you trust the Unicode standard on other things and not on this one? Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 11:56 AM, Antoine Pitrou solip...@pitrou.net wrote: Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit : Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it? No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion. See http://unicode.org/reports/tr9/. The fact that mixed rtl + ltr can render bizarrely or is awkward to cut and paste is quite off-topic for our discussion. No, it is not. One of the invented use cases in this thread was naive users' desire to enter numbers using their preferred local decimals. Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that John is 41 be stored as 41 si nhoJ and Unicode converter would turn it into [RTL]John is 14 that will still display as 41 si nhoJ, but int(s[-2:]) will return 14, not 41. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. So why do you trust the Unicode standard on other things and not on this one? What other things? As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier(). And we have some really bizarre results: '\u2164'.isidentifier() True '\u2164'.isalpha() False and can you describe the difference between str.isdigit() and str.isdecimal()? According to the reference manual, str.isdecimal() Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters include digit characters, and all characters that that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. str.isdigit() Return true if all characters in the string are digits and there is at least one character, false otherwise. http://docs.python.org/dev/library/stdtypes.html#str.isdecimal Since U+0660 is mentioned in the first definition and not in the second, I may conclude that it is not a digit, but '\u0660'.isdigit() True If you know the correct answer, please contribute it here: http://bugs.python.org/issue10587. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le jeudi 02 décembre 2010 à 13:14 -0500, Alexander Belopolsky a écrit : I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it? No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion. See http://unicode.org/reports/tr9/. “This annex describes specifications for the *positioning* of characters flowing from right to left” (emphasis mine) Looks like something for implementors of rendering engines, which python-dev is not AFAICT. Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that John is 41 be stored as 41 si nhoJ and Unicode converter would turn it into [RTL]John is 14 that will still display as 41 si nhoJ, but int(s[-2:]) will return 14, not 41. The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. So why do you trust the Unicode standard on other things and not on this one? What other things? Everything which the Unicode database stores and that we already rely on. As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier(). I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to. And, outside of str itself, the re module tries to follow Unicode categories as well (for example, \d should match non-ASCII digits). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 02.12.2010 03:01, schrieb Ben Finney: Stephen J. Turnbull step...@xemacs.org writes: Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way. What's left is copy/paste via the mouse. For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user? Ultimately, somebody will have entered the data. Input from an existing text file, as I said earlier. Which *specific* existing text file? Have you actually *seen* such a text file? Direct entry at the console is a red herring. And we don't need powerhouses because power comes out of the socket. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst? I don't think all past maintainers can (I'm pretty certain that AMK would disagree), but if that's the current policy, I can certainly try following it (I didn't know it exists because I never look at the file). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Now, one may wonder what precisely a possibly signed floating point number is, but most likely, this refers to floatnumber ::= pointfloat | exponentfloat pointfloat::= [intpart] fraction | intpart . exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= . digit+ exponent ::= (e | E) [+ | -] digit+ digit ::= 0...9 I don't see why the language spec should limit the wealth of number formats supported by float(). If it doesn't, there should be some other specification of what is correct and what is not. It must not be unspecified. True. It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded. Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons. No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users. Please note that we didn't have PEPs and the PEP process at the time. The Unicode proposal predates and in some respects inspired the PEP process. The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time. Regarding actual needs of actual users: I don't buy that as an argument when it comes to supporting a standard that is meant to attract users with non-ASCII origins. Some references you may want to read up on: http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture http://en.wikipedia.org/wiki/Vietnamese_numerals http://en.wikipedia.org/wiki/Korean_numerals http://en.wikipedia.org/wiki/Japanese_numerals Even MS Office supports them: http://languages.siuc.edu/Chinese/Language_Settings.html Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6. That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes). If that were true, then all Unicode database (UCD) changes would make Python unstable. However, most changes to existing code points in the UCS are bug fixes, so they actually have a stabilizing quality more than a destabilizing one. It is not a bug by any definition of bug Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of bug. The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation. The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs. So, yes, we're talking about a documentation bug, but not an implementation bug. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 02.12.2010 20:40, schrieb Martin v. Löwis: Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst? I don't think all past maintainers can Yes, and the same goes for the future ones, since they may not even know yet that they will be whatsnew maintainers. Or maybe they aren't born yet (let's hope for a long life of Python 3...). (I'm pretty certain that AMK would disagree), but if that's the current policy, I can certainly try following it (I didn't know it exists because I never look at the file). The large chunk of rules appeared in 2.6, where AMK still was maintainer. But even in the whatsnew for 2.4, there is this: .. Don't write extensive text for new sections; I'll do that. .. Feel free to add commented-out reminders of things that need .. to be covered. --amk But in any case, they are certainly valid for the current whatsnew -- even if Raymond likes to grumble about too expansive commits :) Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so. The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time. At the time, this may have been the right thing to do. With the experience gained, we should now conclude to revert this particular aspect. Some references you may want to read up on: http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture http://en.wikipedia.org/wiki/Vietnamese_numerals http://en.wikipedia.org/wiki/Korean_numerals http://en.wikipedia.org/wiki/Japanese_numerals I don't question that people use non-ASCII characters to denote numbers. I claim that the specific support in Python for that has no connection to reality. I further claim that the use of non-ASCII numbers is a local convention, and that if you provide a library to parse numbers, users (of that library) will somehow have to specify which notational convention(s) is reasonable for the input they have. Even MS Office supports them: http://languages.siuc.edu/Chinese/Language_Settings.html That's printing, though, not parsing. Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing. Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6. That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes). If that were true, then all Unicode database (UCD) changes would make Python unstable. That's indeed the case - they do (see the recent bug report on white space processing). However, any change makes Python unstable (in the sense that it can potentially break existing applications), and, in many cases, the risk of breaking something is well worth it. In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers). Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of bug. The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation. Of course the 2.x documentation is wrong, in that it is severely underspecified, and the most straight-forward interpretation of the specific wording gives an incorrect impression of the implementation. The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs. Right - but only because the 2.x documentation *already* suggested that the supported syntax matches the literal syntax - as that's the most natural thing to assume. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: [...] For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user? Ultimately, somebody will have entered the data. I don't think you really believe that all data processed by a computer was eventually manually entered by a someone :-) I already gave you a couple of examples of how such data can end up being input for Python number constructors. If you are still curious, please see the Wikipedia pages I linked to, or have a look at these keyboards: http://en.wikipedia.org/wiki/File:KB_Arabic_MAC.svg http://en.wikipedia.org/wiki/File:Keyboard_Layout_Sanskrit.png http://en.wikipedia.org/wiki/File:800px-KB_Thai_Kedmanee.png http://en.wikipedia.org/wiki/File:Tibetan_Keyboard.png http://en.wikipedia.org/wiki/File:KBD-DZ-noshift-2009.png (all referenced on http://en.wikipedia.org/wiki/Keyboard_layout) and then compare these to: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt Arabic numerals are being used a lot nowadays in Asian countries, but that doesn't mean that the native script versions are not being used anymore. Furthermore, data can well originate from texts that were written hundreds or even thousands of years ago, so there is plenty of material available for processing. Even if not entered directly, there are plenty of ways to convert Arabic numerals (or other numeral systems) to the above forms, e.g. in MS Office for Thai: http://office.microsoft.com/en-us/excel-help/convert-arabic-numbers-to-thai-text-format-HP003074364.aspx Anyway, as mentioned before: all this is really besides the point: If we want to support Unicode in Python, we have to also support conversion of numerals declared in Unicode into a form that can be processed by Python. Regardless of where such data originates. If we were not to follow this approach, we could just as well decide not support support reading Egyptian Hieroglyphs based on the argument that there's no keyboard to enter them... http://www.unicode.org/charts/PDF/U13000.pdf :-) (from http://www.unicode.org/charts/) Input from an existing text file, as I said earlier. Which *specific* existing text file? Have you actually *seen* such a text file? Have you tried Google ? http://www.google.com/search?q=١٢٣ http://www.google.com/search?q=٣+site%3Agov.lb Some examples: http://www.bdl.gov.lb/circ/intpdf/int123.pdf http://www.cdr.gov.lb/study/sdatl/Arabic/Chapter3.PDF http://www.batroun.gov.lb/PDF/Waredat2006.pdf (these all use http://en.wikipedia.org/wiki/Eastern_Arabic_numerals) Direct entry at the console is a red herring. And we don't need powerhouses because power comes out of the socket. Martin, the argument simply doesn't fit well with the discussion about Python and Unicode. We introduced Unicode in Python not because there was a need for each and every code point in Unicode, but because we wanted to adopt a standard which doesn't prefer any one way of writing things over another. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Arabic numerals are being used a lot nowadays in Asian countries, but that doesn't mean that the native script versions are not being used anymore. I never claimed that people are not using their local scripts to enter numbers. However, none of your examples is about Chinese numerals using an ASCII full stop as a decimal point. The only thing I claimed about usage (actually only repeating haiyang kang's earlier claim) is that nobody would enter Chinese numerals with a keyboard and then use full stop as the decimal separator. So all your counter-examples just don't apply - I don't deny them. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so. What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals. [...] Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing. Lack of one function, even if more useful, does not imply that an existing function should be removed. [...] In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers). So your problems with the current behaviour are: (1) in some unspecified way, it's not done correctly; (2) it belongs somewhere other than float() and int(). That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour, and Python *already* has the current behaviour, it seems strange that you are kicking up such a fuss merely to *move* the implementation of that behaviour out of the numeric constructors into some unspecified different place. I think it would be constructive to explain: - how the current behaviour is incorrect; - your suggestions for correcting it; and - a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou solip...@pitrou.net wrote: .. I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to. You are joking, right? Where exactly does Unicode specify something like this: ''.join('̀́̂'.split('\udf00\ud800')) '́̂' ? OK, splitting on a given separator has very little to do with Unicode or UCD, but str.splitlines() makes absolutely no attempt to conform to Unicode Standard Annex #14 (Unicode line breaking algorithm). Wait, UAX #14 is actually relevant to textwrap module which saw very little change since 2.x days. So, what exactly does str.splitlines() do? And which part of the Unicode standard defines how it is different from str.split(.., '\n')? Reference manual does not help me here either: str.splitlines([keepends]) Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. http://docs.python.org/dev/library/stdtypes.html#str.splitlines ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le jeudi 02 décembre 2010 à 16:34 -0500, Alexander Belopolsky a écrit : On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou solip...@pitrou.net wrote: .. I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to. You are joking, right? Perhaps you could look at the implementation. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 02.12.2010 22:30, schrieb Steven D'Aprano: Martin v. Löwis wrote: Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so. What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals. Not bogus characters, but bogus character strings. E.g. strings that mix digits from different scripts, and mix them with the Python decimal separator. Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing. Lack of one function, even if more useful, does not imply that an existing function should be removed. No. But if the specific function(ality) is not useful and underspecified, it should be removed. So your problems with the current behaviour are: (1) in some unspecified way, it's not done correctly; No. My main concern is that it is not properly specified. If it was specified, I could then tell you what precisely is wrong about it. Right now, I can only give examples for input that it should not accept, and examples of input that it should, but does not accept. (2) it belongs somewhere other than float() and int(). That's only because it also needs a parameter to specify what syntax to follow, somehow. That parameter could be explicit or implicit, and it could be to float or to some other function. But it must be available, and is not. That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour No, I don't. I think it behaves incorrectly, accepting garbage input and guessing some meaning out of it. - how the current behaviour is incorrect; See above: it accepts strings that do not denote real numbers in any writing system, and, despite the claim that the feature is there to support other writing systems, actually does not truly support other writing systems. - your suggestions for correcting it; and Make the current implementation exactly match the current documentation. I think the documentation is correct; the implementation is wrong. - a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is. The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote: .. Have you tried Google ? I tried google at I could not find any plain text or HTML file that would use Arabic-Indic numerals. What was interesting, though that a search for quran unicode (without quotes). Brought me to http://www.sacred-texts.com which says that they've been using unicode since 2002 in their archives. Interestingly enough, their version of Qur'an uses ordinary digits for ayah numbers. See, for example http://www.sacred-texts.com/isl/uq/050.htm. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 8:23 PM, Martin v. Löwis mar...@v.loewis.de wrote: In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers). +1. The set of strings currently accepted by the float constructor just seems too ad hoc to be at all useful. Apart from the decimal separator issue, and the question of exactly which decimal digits are accepted and which aren't, there are issues like this one: x = '\uff11\uff25\uff0b\uff11\uff10' x '1E+10' float(x) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'decimal' codec can't encode character '\uff25' in position 1: invalid decimal Unicode string y = '\uff11E+\uff11\uff10' y '1E+10' float(y) 100.0 That is, fullwidth *digits* are allowed, but none of the other characters can be fullwidth variants. Unfortunately, a float string doesn't consist solely of digits, and it seems to me to make little sense to allow variation in the digits without allowing corresponding variations in the other characters that might appear ('.', 'e', 'E', '+', '-'). A couple of slightly trickier decisions: (1) the float constructor currently does accept leading and trailing whitespace; should it allow any Unicode whitespace characters here? I'd say yes. (2) For int() rather than float(), there's a bit more value in allowing the variant digits, since it provides an easy way to interpret those digits. The decimal module currently makes use of this, for example (the decimal spec requires that non-European digits be accepted). I'd be happier if this functionality were moved elsewhere, though. The int constructor is, if anything, currently worse off than float, thanks to its attempts to support non-decimal bases. There's value in having an easy-to-specify, easy-to-maintain API for these basic builtin functions. For one thing, it helps non-CPython implementations. [MAL] The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs. That documentation update was my fault; I was motivated to make the update by issues unrelated to this one (mostly to do with Python 3's more consistent handling of inf and nan, as a result of all the new float-string conversion code). If I'd been thinking harder, I would have remembered that float accepted the non-European digits and added a note to that effect. This (unintentional) omission does underline the point that it's difficult right now to document and understand exactly what the float constructor does or doesn't accept. Mark ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 12/2/2010 4:48 PM, Martin v. Löwis wrote: Am 02.12.2010 22:30, schrieb Steven D'Aprano: Martin v. Löwis wrote: Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here. That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so. What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals. Not bogus characters, but bogus character strings. E.g. strings that mix digits from different scripts, and mix them with the Python decimal separator. Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing. Lack of one function, even if more useful, does not imply that an existing function should be removed. No. But if the specific function(ality) is not useful and underspecified, it should be removed. So your problems with the current behaviour are: (1) in some unspecified way, it's not done correctly; No. My main concern is that it is not properly specified. If it was specified, I could then tell you what precisely is wrong about it. Right now, I can only give examples for input that it should not accept, and examples of input that it should, but does not accept. (2) it belongs somewhere other than float() and int(). That's only because it also needs a parameter to specify what syntax to follow, somehow. That parameter could be explicit or implicit, and it could be to float or to some other function. But it must be available, and is not. That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour No, I don't. I think it behaves incorrectly, accepting garbage input and guessing some meaning out of it. - how the current behaviour is incorrect; See above: it accepts strings that do not denote real numbers in any writing system, and, despite the claim that the feature is there to support other writing systems, actually does not truly support other writing systems. - your suggestions for correcting it; and Make the current implementation exactly match the current documentation. I think the documentation is correct; the implementation is wrong. - a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is. The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) Eric. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. Note that float() can also parse integers, it just returns them as floats :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky wrote: On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote: .. Have you tried Google ? I tried google at I could not find any plain text or HTML file that would use Arabic-Indic numerals. What was interesting, though that a search for quran unicode (without quotes). Brought me to http://www.sacred-texts.com which says that they've been using unicode since 2002 in their archives. Interestingly enough, their version of Qur'an uses ordinary digits for ayah numbers. See, for example http://www.sacred-texts.com/isl/uq/050.htm. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Here's an example of a a famous Chinese text using Chinese numerals: http://ctext.org/nine-chapters Unfortunately, the Chinese numerals are not listed in the Category Nd, so Python won't be able to parse them. This has various reasons, it seems, one of them being that the numeral code points were not defined as range of code points. I'm sure you can find other books on mathematics in sanscrit or arabic scripts as well. But this whole branch of the discussion is not going to go anywhere. The point is that we support all of Unicode in Python, not just a fragment, and therefore the numeric constructors support all of Unicode. Using them, it's very easy to support numbers in all kinds of variants, whether bound to a locale or not. Adding more locale aware numeric parsers and formatters to the locale module, based on these APIs is certainly a good idea, but orthogonal to the ongoing discussion, IMO. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/29/2010 10:19 AM, M.-A. Lemburg wrote: Nick Coghlan wrote: On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburgm...@egenix.com wrote: If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc. We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale. Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable. Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves. If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling) The support is built into the C API, so there's not really much surprise there. Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category Nd. See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that1, 5 = 15 = fifteen, butI, V = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. int(), float() and long() (in Python2) are such simplistic parsers. Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted. Good suggestion. I tried to collect as much context as possible: http://bugs.python.org/issue10610 I'll leave the rst-magic to someone else, but will certainly help if you have more questions about the details. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg m...@egenix.com wrote: .. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float(). Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 02.12.2010 23:43, schrieb M.-A. Lemburg: Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. I may be misinterpreting that, but I think that refers to the case of writing numbers using Arabic digits. Chinese digits are, e.g., used in the Suzhou numerals http://en.wikipedia.org/wiki/Suzhou_numerals This doesn't have a decimal point at all. Instead, the second line (below or left to the actual digits) describes the power of ten and the unit of measurement (i.e. similar to scientific notation, but with ideographs for the powers of ten). In another writing system, they use 点 (U+70B9) as the decimal separator, see http://en.wikipedia.org/wiki/Chinese_numerals#Fractional_values In the same system, the integral part uses multipliers, i.e. 12345 is [1][1][2][1000][3][100][4][10][5]; the fractional part uses regular digits. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 12/2/2010 5:43 PM, M.-A. Lemburg wrote: Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. Is that an ASCII dot? That page doesn't say. Note that float() can also parse integers, it just returns them as floats :-) :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
The point is that we support all of Unicode in Python, not just a fragment, and therefore the numeric constructors support all of Unicode. That conclusion is as false today as it was in Python 1.6, but only now people start caring about that. a) we don't support all of Unicode in numeric constructors. There are lots of things that you can write down that readers would recognize as a real/rational/integral number that float() won't parse. b) if float() would restrict itself to the scientific notation of real numbers (as it should), Python could well continue to claim all of Unicode. Adding more locale aware numeric parsers and formatters to the locale module, based on these APIs is certainly a good idea, but orthogonal to the ongoing discussion, IMO. Not at all. The concept of Unicode numbers is flawed: Unicode does *not* prescribe any specific way to denote numbers. Unicode is about characters, and Python supports the Unicode characters for digits as well as it supports all the other Unicode characters. Instead, support for non-scientific notation of real numbers should be based on user needs, which probably can be approximated by looking at actual scripts. This, in turn, is inherently locale-dependent. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Eric Smith wrote: On 12/2/2010 5:43 PM, M.-A. Lemburg wrote: Eric Smith wrote: The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. I agree with everything Martin says here. I think the basic premise is: you won't find strings in the wild that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) http://en.wikipedia.org/wiki/Decimal_mark In China, comma and space are used to mark digit groups because dot is used as decimal mark. Is that an ASCII dot? That page doesn't say. Yes, but to be fair: I think that the page actually refers to the use of the Arabic numeral format in China, rather than with their own script symbols. Note that float() can also parse integers, it just returns them as floats :-) :) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg m...@egenix.com wrote: .. Some examples: http://www.bdl.gov.lb/circ/intpdf/int123.pdf I looked at this one more closely. While I cannot understand what it says, It appears that Arabic numerals are used in dates. It looks like Python want be able to deal with those: datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d') .. ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d' Interestingly, datetime.strptime('١٩٩٩', '%Y') datetime.datetime(1999, 1, 1, 0, 0) which further suggests that support of such numerals is accidental. As I think more about it, though I am becoming less avert to accepting these numerals for base 10 integers. Integers can be easily extracted from text using simple regex and '\d' accepts all category Nd characters. I would require though that all digits be from the same block, which is not hard because Unicode now promises to only have them in contiguous blocks of 10. This rule seems to address some of security issues because it is unlikely that a system that can display some of the local digits would not be able to display all of them properly. I still don't think it makes any sense to accept them in float(). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Stephen J. Turnbull wrote: Steven D'Aprano writes: With full respect to haiyang kang, hear-say from one person can hardly be described as strong evidence That's *disrespectful* nonsense. What Haiyang reported was not hearsay, it's direct observation of what he sees around him and personal experience, plus extrapolation. Look up hearsay, please. Fair enough. I choose my words poorly and apologise. A better description would be anecdotal evidence. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 12/2/2010 6:54 PM, Alexander Belopolsky wrote: On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburgm...@egenix.com wrote: .. Some examples: http://www.bdl.gov.lb/circ/intpdf/int123.pdf I looked at this one more closely. While I cannot understand what it says, It appears that Arabic numerals are used in dates. It looks like Python want be able to deal with those: When I travelled in S. Asia around 25 years ago, arabic and indic numerals were in obvious use in stores, road signs, and banks (as with money exchange receipts). I learned the digits partly for self-protestions ;-). I have no real idea of what is done *now* in computerized business, but I assume the native digits are used. It may well be that there is no Python software yet that operates with native digits. The lack of direct output capability would hinder that. Of course, someone could run both input and output through language-specific str.translate digit translators. datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d') Googling ١٩٩٩ gets about 83,000 hits. .. ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d' Interestingly, datetime.strptime('١٩٩٩', '%Y') datetime.datetime(1999, 1, 1, 0, 0) which further suggests that support of such numerals is accidental. As I think more about it, though I am becoming less avert to accepting these numerals for base 10 integers. Both input and output are needed for educational programming, though translation tables might be enough. Integers can be easily extracted from text using simple regex and '\d' accepts all category Nd characters. I would require though that all digits be from the same block, which is not hard because Unicode now promises to only have them in contiguous blocks of 10. That seems sensible. This rule seems to address some of security issues because it is unlikely that a system that can display some of the local digits would not be able to display all of them properly. I still don't think it makes any sense to accept them in float(). For the present, I would pretty well agree with that, at least until we know more. You have raised an important issue. It is a bit of a chicken and egg problem though. We will not really know what is needed until Python is used more in non-english/non-euro contexts, while such usage may await better support. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Lennart Regebro writes: 2010/12/2 Stephen J. Turnbull step...@xemacs.org: T1000 = float('一.◯◯◯') That was already discussed here, and it's clear that unicode does not consider these characters to be something you can use in a decimal number, and hence it's not broken. Huh? IOW, use Unicode features just because they're there, what the users want and use doesn't matter? The only evidence I've seen so far that this feature is anything but a a toy for a small faction of developers is Neil Hodgson's information that OOo will generate these kinds of digits (note that it *will* do Han! so the evidence is as good for users demanding Han numerals as for any other kind, Unicode.org definitions notwithstanding), and that DOS CP 864 contains the Indo/Arabic versions. Of course, it's quite possible that those were toys for the developers of those software packages too. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Furthermore, data can well originate from texts that were written hundreds or even thousands of years ago, so there is plenty of material available for processing. humm..., for this, i think we need a special tuned language processing system to handle this, and one subsystem for one language :)... (sometimes a single word is not enough, we also need context) Take pi for example, in modern math, it is wrote as: 3.1415...; in old China, it is sometimes wrote as: 三一四一五 or 三点一四一五 or 叁点壹肆壹伍; And if these texts are extracted through scanner (OCR or other image processing tech), in my POV, it is the job of this image processing subsystem (or some other subsystem between the image processing and database) to do the mapping between number and raw text data, example table in DB: text | raw data|raw image data ---|-|--- 3.1415 | 三一四一五| image... br, khy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Neil Hodgson writes: While I don't have Excel to test with, OpenOffice.org Calc will display in Arabic or Han numerals using the NatNum format codes. Display is different from input, but at least this is concrete evidence. Will it accept Arabic on input? (Han might be too much to ask for since Unicode considers Han digits to be impure.) Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT. DOS code page 864 does use 0xB0-0xB9 OK, Microsoft thought it would be useful. I'd still like to know whether people actually use them for input (or output, for that matter -- anybody have a corpus of Arabic Form 10-Ks to grep through?), but that's more concrete evidence than we've seen before. Thank you! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Antoine Pitrou writes: The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right. Uhmm, the argument *for* this feature proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to. If Python *doesn't* do it right, why should Python do it at all? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Thu, Dec 2, 2010 at 4:57 PM, Mark Dickinson dicki...@gmail.com wrote: .. (the decimal spec requires that non-European digits be accepted). Mark, I think *requires* is too strong of a word to describe what the spec says. The decimal module documentation refers to two authorities: 1. IBM’s General Decimal Arithmetic Specification 2. IEEE standard 854-1987 The IEEE standards predates Unicode and unsurprisingly does not have anything related to the issue. the IBM's spec says the following in the Conversions section: It is recommended that implementations also provide additional number formatting routines (including some which are locale-dependent), and if available should accept non-European decimal digits in strings. http://speleotrove.com/decimal/daconvs.html This cannot possibly be interpreted as normative text. The emphasis is clearly on formatting routines with non-European decimal digits added as an afterthought. This recommendation can reasonably be interpreted as a requirement that conversion routines should accept what formatting routines can produce. In Python there are no formatting routines to produce non-European numerals, so there is no requirement to accept them in conversions. I don't think decimal module should support non-European decimal digits. The only place where it can make some sense is in int() because here we have a fighting chance of producing a reasonable definition. The motivating use case is conversion of numerical data extracted from text using simple '\d+' regex matches. Here is how I would do it: 1. String x of non-European decimal digits is only accepted in int(x), but not by int(x, 0) or int(x, 10). 2. If x contains one or more non-European digits, then (a) all digits must be from the same block: def basepoint(c): return ord(c) - unicodedata.digit(c) all(basepoint(c) == basepoint(x[0]) for c in x) - True (b) and '+' or '-' sign is not alowed. 3. A character c is a digit if it matches '\d' regex. I think this means unicodedata.category(c) - 'Nd'. Condition 2(b) is important because there is no clear way to define what is acceptable as '+' or '-' using Unicode character properties and not all number systems even support local form of negation. (It is also YAGNI.) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/30/2010 10:05 AM, Alexander Belopolsky wrote: My general answers to the questions you have raised are as follows: 1. Each new feature release should use the latest version of the UCD as of the first beta release (or perhaps a week or so before). New chars are new features and the beta period can be used to (hopefully) iron out any bugs introduced by a new UCD version. The UCD is versioned just like Python is, so if the Unicode Consortium decides to ship a 5.2.1 version of the UCD, we can add that to Python 2.7.x, since Python 2.7 started out with 5.2.0. 2. The language specification should not be UCD version specific. Martin pointed out that the definition of identifiers was intentionally written to not be, bu referring to 'current version' or some such. On the other hand, the UCD version used should be programatically discoverable, perhaps as an attribute of sys or str. It already is and has been for while, e.g. Python 2.5: import unicodedata unicodedata.unidata_version '4.1.0' 3.. The UCD should not change in bugfix releases. New chars are new features. Adding them in bugfix releases will introduce gratuitous imcompatibilities between releases. People who want the latest Unicode should either upgrade to the latest Python version or patch an older version (but not expect core support for any problems that creates). See above. Patch level revisions of the UCD are fine for patch level releases of Python, since those patch level revisions of the UCD fix bugs just like we do in Python. Note that each new UCD major.minor version is a new standard on its own, so it's perfectly ok to stick with one such standard version per Python version. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Am 30.11.2010 21:24, schrieb Ben Finney: haiyang kang corn...@gmail.com writes: I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 That's a straw man, though. The string need not be a literal in the program; it can be input to the program. num = float(input_from_the_external_world) Does that change your assessment of whether non-ASCII digits are used? I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. You would need a number of key strokes to enter each individual ideograph, plus you have to press the keys for keyboard layout switching to enter the Latin decimal separator (which you normally wouldn't use along with the Han numerals). That's a somewhat limited view, IMHO. Numbers are not always entered using a computer keyboard, you have tool like cash registries, special numeric keypads, scanners, OCR, etc. for external entry, and you also have other programs producing such output, e.g. MS Office if configured that way. The argument with the decimal point doesn't work well either, since it's obvious that float() and int() do not support localized input. E.g. in Germany we write 3,141 instead of 3.141: float('3,141') Traceback (most recent call last): File stdin, line 1, in module ValueError: invalid literal for float(): 3,141 No surprise there. The localization of the input data, e.g. removal of thousands separators and conversion of decimal marks to the dot, have to be done by the application, just like you have to now for German floating point number literals. The locale module already has locale.atof() and locale.atoi() for just this purpose. FYI, here's a list of decimal digits supported by Python 2.7: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt: 0030..0039; Decimal # Nd [10] DIGIT ZERO..DIGIT NINE 0660..0669; Decimal # Nd [10] ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT NINE 06F0..06F9; Decimal # Nd [10] EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED ARABIC-INDIC DIGIT NINE 07C0..07C9; Decimal # Nd [10] NKO DIGIT ZERO..NKO DIGIT NINE 0966..096F; Decimal # Nd [10] DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE 09E6..09EF; Decimal # Nd [10] BENGALI DIGIT ZERO..BENGALI DIGIT NINE 0A66..0A6F; Decimal # Nd [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGIT NINE 0AE6..0AEF; Decimal # Nd [10] GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE 0B66..0B6F; Decimal # Nd [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE 0BE6..0BEF; Decimal # Nd [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE 0C66..0C6F; Decimal # Nd [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE 0CE6..0CEF; Decimal # Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE 0D66..0D6F; Decimal # Nd [10] MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE 0E50..0E59; Decimal # Nd [10] THAI DIGIT ZERO..THAI DIGIT NINE 0ED0..0ED9; Decimal # Nd [10] LAO DIGIT ZERO..LAO DIGIT NINE 0F20..0F29; Decimal # Nd [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE 1040..1049; Decimal # Nd [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT NINE 1090..1099; Decimal # Nd [10] MYANMAR SHAN DIGIT ZERO..MYANMAR SHAN DIGIT NINE 17E0..17E9; Decimal # Nd [10] KHMER DIGIT ZERO..KHMER DIGIT NINE 1810..1819; Decimal # Nd [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE 1946..194F; Decimal # Nd [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE 19D0..19DA; Decimal # Nd [11] NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM DIGIT ONE 1A80..1A89; Decimal # Nd [10] TAI THAM HORA DIGIT ZERO..TAI THAM HORA DIGIT NINE 1A90..1A99; Decimal # Nd [10] TAI THAM THAM DIGIT ZERO..TAI THAM THAM DIGIT NINE 1B50..1B59; Decimal # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE 1BB0..1BB9; Decimal # Nd [10] SUNDANESE DIGIT ZERO..SUNDANESE DIGIT NINE 1C40..1C49; Decimal # Nd [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE 1C50..1C59; Decimal # Nd [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGIT NINE A620..A629; Decimal # Nd [10] VAI DIGIT ZERO..VAI DIGIT NINE A8D0..A8D9; Decimal # Nd [10] SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE A900..A909; Decimal # Nd [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGIT NINE A9D0..A9D9; Decimal # Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE AA50..AA59; Decimal # Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE ABF0..ABF9; Decimal # Nd [10] MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT NINE FF10..FF19; Decimal # Nd [10] FULLWIDTH DIGIT ZERO..FULLWIDTH DIGIT NINE 104A0..104A9 ; Decimal # Nd [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE 1D7CE..1D7FF ; Decimal # Nd [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE DIGIT NINE The Chinese and Japanese ideographs are not supported because of the way they are defined in the Unihan database. I'm currently investigating how we could support them as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source
Re: [Python-Dev] Python and the Unicode Character Database
Terry Reedy wrote: On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote: I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. Me neither. This is solely about Python being able to parse numeric input in the float(), int() and complex() constructors. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: Am 30.11.2010 23:43, schrieb Terry Reedy: On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote: I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. By that argument, English speakers wanting to enter integers using Arabic numerals can't either! I'd like to use grouping for large literals, if only I could think of a half-decent syntax, and if only Python supported it. This fails on both counts: x = 123_456_789_012_345 The lack of grouping and the lack of a native decimal point doesn't mean that the feature doesn't work -- it merely means the feature requires some compromise before it can be used. In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer. The lack of support for non-dot decimal points is arguably a bug that should be fixed, not a reason to remove functionality. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull step...@xemacs.org wrote: Sure you can. In Python program text, all keywords will be ASCII Yes, yes, sure, but not the contents of variables, I see no reason not to make a similar promise for numeric literals. Wait what, literas? The example was float('١٢٣٤.٥٦') Which doesn't have any numeric literals in them at all. Do that work? Nope, it's a syntax error. Too badm that would have been cool, but whatever. Why would this be a problem: T1234 = float('١٢٣٤.٥٦') T1234 1234.56 But this OK? T١٢٣٤ = float('1234.56') T١٢٣٤ 1234.56 I don't see that. Should we bother to implement ١٢٣٤.٥٦ as a literal equivalent to 1234.56? Well, not unless somebody askes for it, or it turns out to be easy. :-) But that's another question. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburg m...@egenix.com wrote: .. With Python 3.1: exec('\u0CF1 = 1') Traceback (most recent call last): File stdin, line 1, in module File string, line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier but with Python 3.2a4: exec('\u0CF1 = 1') eval('\u0CF1') 1 Such changes are not new, but I agree that they should probably be highlighted in the What's new in Python x.x. As of today, What’s New In Python 3.2 [1] does not even mention the unicodedata upgrade to 6.0.0. Here are the features form the unicode.org summary [2] that I think should be reflected in Python's What's New document: * adds 2,088 characters, including over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones; * corrects character properties for existing characters including - a general category change to two Kannada characters (U+0CF1, U+0CF2), which has the effect of making them newly eligible for inclusion in identifiers; - a general category change to one New Tai Lue numeric character (U+19DA), which would have the effect of disqualifying it from inclusion in identifiers unless grandfathering measures are in place for the defining identifier syntax. The above may be too verbose for inclusion to What’s New In Python 3.2, but I think we should add a possibly shorter summary with a link to unicode.org for details. PS: Yes, I think everyone should know about the Python 3.2 killer feature: ('\N{CAT FACE WITH WRY SMILE}'! [1] http://docs.python.org/dev/whatsnew/3.2.html [2] http://www.unicode.org/versions/Unicode6.0.0/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 12/1/2010 12:55 PM, Alexander Belopolsky wrote: On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburgm...@egenix.com wrote: .. With Python 3.1: exec('\u0CF1 = 1') Traceback (most recent call last): File stdin, line 1, inmodule File string, line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier but with Python 3.2a4: exec('\u0CF1 = 1') eval('\u0CF1') 1 Such changes are not new, but I agree that they should probably be highlighted in the What's new in Python x.x. As of today, What’s New In Python 3.2 [1] does not even mention the unicodedata upgrade to 6.0.0. Here are the features form the unicode.org summary [2] that I think should be reflected in Python's What's New document: * adds 2,088 characters, including over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones; * corrects character properties for existing characters including - a general category change to two Kannada characters (U+0CF1, U+0CF2), which has the effect of making them newly eligible for inclusion in identifiers; - a general category change to one New Tai Lue numeric character (U+19DA), which would have the effect of disqualifying it from inclusion in identifiers unless grandfathering measures are in place for the defining identifier syntax. The above may be too verbose for inclusion to What’s New In Python 3.2, I think those 11 lines are pretty good. Put them in ('\N{CAT FACE WITH WRY SMILE}'! Plus give a link to Unicode site (Issue numbers are implicit links). -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. By that argument, English speakers wanting to enter integers using Arabic numerals can't either! That's correct, and the key point here for the argument. It's just not *meant* to support localized number forms, but deliberately constrains them to a formal grammar which users using it must be aware of in order to use it. I'd like to use grouping for large literals, if only I could think of a half-decent syntax, and if only Python supported it. This fails on both counts: x = 123_456_789_012_345 Here you are confusing issues, though: this fragment uses the syntax of the Python programming language. Whether or not the syntax of the float() constructor arguments matches that syntax is also a subject of the debate. I take it that you speak in favor of the float syntax also being used for the float() constructor. The lack of grouping and the lack of a native decimal point doesn't mean that the feature doesn't work -- it merely means the feature requires some compromise before it can be used. No, it means that the Python programming language syntax for floating point numbers just doesn't take local notation into account *at all*. This is not a flaw - it just means that this feature is non-existent. Now, for the float() constructor, some people in this thread have claimed that it *is* aimed at people who want to enter numbers in their local spellings. I claim that this feature either doesn't work, or is absent also. In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer. Why would you want that, if, what you really wanted, could not be done. There certainly *is* a way to convert strings into floats, and there would be a way if that restricted itself to the digits 0..9. So it can't be the mere desire to convert strings to float that make you ask for non-ASCII digits. The lack of support for non-dot decimal points is arguably a bug that should be fixed, not a reason to remove functionality. I keep repeating my two concerns: a) if that was a feature, it is not specified at all in the documentation. In fact, the documentation was recently clarified to deny existence of that feature. b) fixing it will be much more difficult than you apparently think. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. Who's talking about *entering* it into the program at a keyboard directly, though? Input to a program can come from all kinds of crazy sources. Just because it wasn't typed by the person at the keyboard using this program doesn't stop it being input to the program. I think haiyang kang claimed exactly that - it won't ever be input to a program. I trust him on that - and so should you, unless you have sufficient experience with the Chinese language and writing system. Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code. And indeed, for the Chinese numerals, we have such strong evidence. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
As of today, What’s New In Python 3.2 [1] does not even mention the unicodedata upgrade to 6.0.0. One reason was that I was instructed not to change What's New a few years ago. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. Who's talking about *entering* it into the program at a keyboard directly, though? Input to a program can come from all kinds of crazy sources. Just because it wasn't typed by the person at the keyboard using this program doesn't stop it being input to the program. I think haiyang kang claimed exactly that - it won't ever be input to a program. I trust him on that - and so should you, unless you have sufficient experience with the Chinese language and writing system. Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code. And indeed, for the Chinese numerals, we have such strong evidence. With full respect to haiyang kang, hear-say from one person can hardly be described as strong evidence -- particularly, as Alexander Belopolsky pointed out, the use-case described isn't currently supported by Python. Given that what haiyang kang describes *can't* be done, the fact that people don't do it is hardly surprising -- nor is it a good reason for taking away functionality that does exist. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Wed, Dec 1, 2010 at 5:36 PM, Martin v. Löwis mar...@v.loewis.de wrote: .. Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code. And indeed, for the Chinese numerals, we have such strong evidence. Indeed: it over 10 years that Python's int() accepted Arabic-Indic numerals, nobody has complained that it *did not* accept Chinese. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Martin v. Löwis wrote: And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. By that argument, English speakers wanting to enter integers using Arabic numerals can't either! That's correct, and the key point here for the argument. It's just not *meant* to support localized number forms, but deliberately constrains them to a formal grammar which users using it must be aware of in order to use it. You're *agreeing* that English speakers can't enter integers using Arabic numerals? What do you think I'm doing when I do this? int(1234) 1234 Ah wait... did you think I meant Arabic numerals in the sense of digits used by Arabs in Arabia? I meant Arabic numerals as opposed to Roman numerals. Sorry for the confusion. Your argument was that even though Python's int() supports many non-ASCII digits, the lack of grouping means that it doesn't actually work. If that argument were correct, then it applies equally to ASCII digits as well. It's clearly nonsense to say that int(1234) doesn't work just because of the lack of grouping. It's equally nonsense to say that int(١٢٣٤) doesn't work because of the lack of grouping. [...] I take it that you speak in favor of the float syntax also being used for the float() constructor. I'm sorry, I don't understand what you mean here. I've repeatedly said that the syntax for numeric literals should remain constrained to the ASCII digits, as it currently is. n = ١٢٣٤ gives a SyntaxError, and I don't want to see that change. But I've also argued that the float constructor currently accepts non-ASCII strings: n = int(١٢٣٤) we should continue to support the existing behaviour. None of the arguments against it seem convincing to me, particularly since the opponents of the current behaviour admit that there is a use-case for it, but they just want it to move elsewhere, such as the locale module. We've even heard from one person -- I forget who, sorry -- who claimed that C++ has the same behaviour, and if you want ASCII-only digits, you have to explicitly ask for it. For what it's worth, Microsoft warns developers not to assume users will enter numeric data using ASCII digits: Number representation can also use non-ASCII native digits, so your application may encounter characters other than 0-9 as inputs. Avoid filtering on U+0030 through U+0039 to prevent frustration for users who are trying to enter data using non-ASCII digits. http://msdn.microsoft.com/en-us/magazine/cc163506.aspx There was a similar discussion going on in Perl-land recently: http://www.nntp.perl.org/group/perl.perl5.porters/2010/07/msg162400.html although, being Perl, the discussion was dominated by concerns about regexes and implicit conversions, rather than an explicit call to float() or int() as we are discussing here. [...] In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer. Why would you want that, if, what you really wanted, could not be done. There certainly *is* a way to convert strings into floats, and there would be a way if that restricted itself to the digits 0..9. So it can't be the mere desire to convert strings to float that make you ask for non-ASCII digits. Why do Europeans use programming languages that force them to use a dot instead of a comma for the decimal place? Why do I misspell string.centre as string.center? Because if you want to get something done, you use the tools you have and not the tools you'd like to have. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Lennart Regebro writes: On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull step...@xemacs.org wrote: Sure you can. In Python program text, all keywords will be ASCII Yes, yes, sure, but not the contents of variables, Irrelevant, you're not converting these to a string representation. If you're generating numerals for internal use, I don't see why you would want to do arithmetic on them; conversion is a YAGNI. This is only interesting to allow naive users to input in a comfortable way. As yet there is no evidence that there are *any* such naive users, 1.3 billion of possibles are shut out, and at least two cultures which use non-ASCII numerals every day, representing 1.3 billion naive users (the coincidence of numbers is no coincidence), have reported that nobody in their right mind would would *input* the numbers that way, and at least for Japanese, the use cases are not really numeric anyway. I see no reason not to make a similar promise for numeric literals. Wait what, literas? Sorry, my bad. Why would this be a problem: T1234 = float('.~~') T1234 1234.56 But this OK? T = float('1234.56') T 1234.56 (Sorry, the Arabic is going to get munged, my mailer is beta and somebody screwed up.) Because the characters in the identifier are uninterpreted and have no syntactic content other than their identity. They're arbitrary. That's not true of numerics. Because that works, but print(T1234) doesn't (it prints ASCII). You can't round-trip, but users will want/expect that. Because that works but this doesn't: T1000 = float('一.◯◯◯') Violates TOOWTDI. If you're proposing to fix the numeric parsers, I still don't like it but I could go to -0 on it. However as Alexander points out and MAL admits, it's apparently not so easy to do that. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Wed, Dec 1, 2010 at 7:17 PM, Steven D'Aprano st...@pearwood.info wrote: .. we should continue to support the existing behaviour. None of the arguments against it seem convincing to me, particularly since the opponents of the current behaviour admit that there is a use-case for it, but they just want it to move elsewhere, such as the locale module. I don't remember who made this argument, but I think you misunderstood it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right. unicodedata.bidirectional('٤') 'AN' is not very helpful because it means any Arabic-Indic digit according to unicode.org. (To me, a special category hints that it may be written in either direction and the proper interpretation may depend on context.) I have not seen a real use case reported in this thread and for theoretical use cases, the current implementation is either outright wrong or does not solve the problem completely. Given that a function that replaces all Unicode digits in a string with 0-9 can be written in 3 lines of Python code, it is very unlikely that anyone would prefer to rely on undocumented behavior of Python builtins instead of having explicit control over parsing of their data. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Steven D'Aprano writes: With full respect to haiyang kang, hear-say from one person can hardly be described as strong evidence That's *disrespectful* nonsense. What Haiyang reported was not hearsay, it's direct observation of what he sees around him and personal experience, plus extrapolation. Look up hearsay, please. Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way. What's left is copy/paste via the mouse. I assure you, every day I see dozens of Japanese copy/pasting *only* ASCII numerals, and the sales figures for Microsoft Excel (not to mention the download numbers for Open Office) strongly suggest that 30 million Japanese salarymen are similarly dedicated to ASCII. (That's not hearsay either, that's direct observation and extrapolation, which is more than the we need float to translate Arabic supporters can offer.) I have seen only *one* use case: it's a toy for sophisticated programmers who want to think of themselves as broadminded. We've seen several examples of that in this thread, so I can't deny that is a real use case. Please, give us just *one* more real use case that isn't somebody might. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Stephen J. Turnbull step...@xemacs.org writes: Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way. What's left is copy/paste via the mouse. For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user? Input to a program comes from various sources other than direct entry by the interactive user, as has been pointed out many times. Please, give us just *one* more real use case that isn't somebody might. Input from an existing text file, as I said earlier. Or any other way of text data making its way into a Python program. Direct entry at the console is a red herring. -- \ “First things first, but not necessarily in that order.” —The | `\ Doctor, _Doctor Who_ | _o__) | Ben Finney ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 12/1/2010 7:44 PM, Alexander Belopolsky wrote: it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right. Both my personal observations when travelling from Turkey to India and Wikipedia say yes. When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts. https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Wed, Dec 1, 2010 at 10:11 PM, Terry Reedy tjre...@udel.edu wrote: On 12/1/2010 7:44 PM, Alexander Belopolsky wrote: it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right. Both my personal observations when travelling from Turkey to India and Wikipedia say yes. When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts. https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order. It seems to me that it can go either way depending on the surrounding text and/or presence of explicit formatting codes. Also, I don't understand why Eastern Arabic-Indic digits have the same Bidi-Class as European digits, but Arabic-Indic digits, Arabic decimal and thousands separators have Bidi-Class AN. http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Ben Finney writes: Input from an existing text file, as I said earlier. Or any other way of text data making its way into a Python program. Direct entry at the console is a red herring. I don't think it is. Not at all. Here's why: '''print %d % some_integer''' doesn't now, and never will (unless Kristan gets his Python 2.8wink), produce Arabic or Han numerals. Not in any language I know of, not in Microsoft Excel, and definitely not in Python 2. *Somebody* typed that text at some point. If it's Han, that somebody had *way* too much time on his hands, not a working accountant nor a graduate assistant in a research lab for sure. How about old archived texts, copied and recopied? At least for Japanese, old archival (text) data will *all* be in ASCII, because the earliest implementations of Japanese language text used JIS X 0201 (or its predecessor), which doesn't have Han digits (and kana digits don't exist even if you write with a brush and ink AFAIK). Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT. Note that there's plenty of space for them in that code table (eg, 0xB0-0xB9 is empty). Apparently nobody *ever* thought it was useful to have them! So, which culture, using which script and in which application, inputs numeric data in other than ASCII digits? Or would want to, if only somebody would tell them they can do it in Python? Hearsay will do, for starters. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Sun, Nov 28, 2010 at 21:24, Alexander Belopolsky alexander.belopol...@gmail.com wrote: While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. Why? I can see this is a problem if one character that earlier was allowed no longer is. That breaks backwards compatibility. This doesn't. float('١٢٣٤.٥٦') 1234.56 is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII. *I* think it is more important. In python 3, you can never ever assume anything is ASCII any more. ASCII is practically dead an buried as far as Python goes, unless you explicitly encode to it. def deposit(self, amountstr): self.balance += float(amountstr) audit_log(Deposited: + amountstr) Auditor: $ cat numbered-account.log Deposited: ?.?? That log reasonably should be in UTF-8 or something else, in which case this is not a problem. And that's ignoring that it makes way more sense to log the numerical amount. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python3porting.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003. It's covered by As the standard library is not directly tied to the language definition it is not covered by this moratorium. How is this restricted to the stdlib if it defines the set of valid identifiers? - Hagen ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Lennart Regebro writes: *I* think it is more important. In python 3, you can never ever assume anything is ASCII any more. Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8wink) for the forseeable future. I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. As soon as somebody gives an example of a culture, however minor, that uses computers but actively prefers to use non-ASCII numerals to express numbers in an IT context, I'll review my thinking. But at the moment it's 101% YAGNI. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
hi, I agree with this. I never seen any man in China using chinese number literals (at least two kinds:一, 壹, same meaning with 1) in Python program, except UI output. They can do some mappings when want to output these non-ascii numbers. Example: if 1: print 一 I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 br, khy On Tue, Nov 30, 2010 at 4:23 PM, Stephen J. Turnbull step...@xemacs.org wrote: Lennart Regebro writes: *I* think it is more important. In python 3, you can never ever assume anything is ASCII any more. Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8wink) for the forseeable future. I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. As soon as somebody gives an example of a culture, however minor, that uses computers but actively prefers to use non-ASCII numerals to express numbers in an IT context, I'll review my thinking. But at the moment it's 101% YAGNI. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/cornsea%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
haiyang kang wrote: hi, I agree with this. I never seen any man in China using chinese number literals (at least two kinds:一, 壹, same meaning with 1) in Python program, except UI output. They can do some mappings when want to output these non-ascii numbers. Example: if 1: print 一 I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 I don't expect that anyone would sensibly write code like that, except for testing. You wouldn't write num = float(1.1) instead of just num = 1.1 either. But you should be able to write: text = input(Enter a number using your preferred digits: ) num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Stephen J. Turnbull wrote: Lennart Regebro writes: *I* think it is more important. In python 3, you can never ever assume anything is ASCII any more. Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8wink) for the forseeable future. I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I agree with you that numeric *literals* should be restricted to the ASCII digits. I don't think anyone here is arguing differently -- if they are, they should speak up and try to make the case for allowing numeric literals in arbitrary scripts. Python doesn't currently allow non-ASCII numeric literals, and even if such a change were desirable, it would run up against the moratorium. So let's just forget the specter of code like: x = math.sqrt(١٢٣٤.٥٦ ** 一.一) It ain't gonna happen :) But I think there is a good case for allowing the constructors int, float and complex to continue to accept numeric *strings* with non-ASCII digits. The code already exists, there's probably people out there who rely on it, and in the absence of any convincing demonstration that the existing behaviour is causing widespread difficulty, we should leave well-enough alone. Various people have suggested that there should be a function in the locale module that handles numeric string input in non-ASCII digits. This is a de facto admission that there are use-cases for taking user input like the string '٣' and turning it into the int 3. Python can already do this, and has been able to for many years: [st...@sylar ~]$ python2.4 Python 2.4.6 (#1, Mar 30 2009, 10:08:01) [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2 Type help, copyright, credits or license for more information. int(u'٣') 3 It seems to me that there's no need to move this functionality into locale. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Wed, 01 Dec 2010 00:23:22 +1100 Steven D'Aprano st...@pearwood.info wrote: But I think there is a good case for allowing the constructors int, float and complex to continue to accept numeric *strings* with non-ASCII digits. The code already exists, there's probably people out there who rely on it, and in the absence of any convincing demonstration that the existing behaviour is causing widespread difficulty, we should leave well-enough alone. +1 It seems to me that there's no need to move this functionality into locale. Not only, but moving it into locale won't make it easier to maintain anyway. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Nov 30, 2010 at 7:59 AM, Steven D'Aprano st...@pearwood.info wrote: .. But you should be able to write: text = input(Enter a number using your preferred digits: ) num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. I find it ironic that people who argue for preservation of the current behavior do it without checking what it actually is: float('一.一') .. UnicodeEncodeError: 'decimal' codec can't encode character '\u4e00' .. This one of the biggest problems with this feature. It does not fit user's expectations. Even the original author of the decimal codec expected the above to work. [1] Python can already do this, and has been able to for many years: int(u'٣') 3 but you can do this without support from int() as well: import unicodedata unicodedata.digit('٣') 3 and for Unihan numbers, you can do unicodedata.numeric('一') 1.0 and unicodedata.numeric('ⅷ') 8.0 and if you are so inclined, [unicodedata.numeric(c) for c in ↂ ↁ ⅗ ⅞ ij.split()] [1.0, 5000.0, 0.6, 0.875, 9.0] Do you want to see all these supported by float()? [1] makeunicodedata.py does not support Unihan digit data http://bugs.python.org/issue10575 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
But you should be able to write: text = input(Enter a number using your preferred digits: ) num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. yes. from logical point of view, this can happen. But i really doubt that if really there are users who would like to input number like that, means that they first use google pinyin method to input 一, then change to english input method to input . , then change to google pinyin again for the other 一; or maybe you mean they input the whole 一.一 words with google pinyin input method. To input 1, users only need to type one time keyboard, but to input 一, they need to type three times (yi SPACE). Of course, users can also input something accidentally, but we just need to give them some kind reminders. At least coders in my around will restrain their system users to input numbers with ASCII, and seems that users are still happy with the ASCII type numbers :). br, khy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Mon, Nov 29, 2010 at 4:13 PM, Martin v. Löwis mar...@v.loewis.de wrote: - Should Python documentation refer to the specific version of Unicode that it supports? You mean, mention it somewhere? Sure (although it would be nice if the documentation generator would automatically extract it from the source, just as it extracts the Python version number). Of course, such mentioning should explain that this is specific to CPython, and not an aspect of Python-the-language. Current documentation refers to old versions. Should version be updated or removed to imply the latest? What specific reference are you referring to? I found two places: A reference to Unicode 3.0 (!) in the Data Model section and a reference to 5.2.0 in unicodedata docs. See http://mail.python.org/pipermail/docs/2010-November/002074.html - How UCD updates should be handled during the language moratorium? It's clearly not affected. This is not what Guido said last year: One question: There are currently number of patch waiting on the tracker for additional Unicode feature support and it's also likely that we'll want to upgrade to a more recent Unicode version within the next few years. How would such indirect changes be seen under the moratorium ? That would fall under the Case-by-Case Exemptions section. Within the next few years sounds like it might well wait until the moratorium is ended though. :-) http://mail.python.org/pipermail/python-dev/2009-November/093666.html I don't see it as a big deal, but technically speaking, with Unicode 6.0 changing properties of two characters to become identifiers Python language definition is affected. For example, an alternative implementation based on 5.2.0 will not accept a valid CPython program that uses one of these characters. During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003. It's covered by As the standard library is not directly tied to the language definition it is not covered by this moratorium. See above. Also, it has been suggested that semantics of built-ins cannot change. (If that was so, it would put int('١٢٣٤') debate to rest at least for the time being.:-) Should this upgrade be backported to 2.7? No, it's a new feature. Given that 2.7 will be maintained for 5 years and arguably Unicode Consortium takes backward compatibility very seriously, wouldn't it make sense to consider a backport at some point? I am sure we will soon see a bug report that the following does not work in 2.7: :-) ord('\N{CAT FACE WITH WRY SMILE}') 128572 - How specific should library reference manual be in defining methods affected by UCD such as str.upper()? It should specify what this actually does in Unicode terminology (probably in addition to a layman's rephrase of that) I opened an issue for this: http://bugs.python.org/issue10587 .. For example, if '\U'.isalpha() returns true in one implementation, can it return false in another? Implementations are free to use any version of the UCD. I was more concerned about wide an narrow unicode CPython builds. Is it a bug that '\U'.isalpha() may disagree even when the two implementations are based on the same version of UCD? Thanks for your answers. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang corn...@gmail.com wrote: But you should be able to write: text = input(Enter a number using your preferred digits: ) num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. yes. from logical point of view, this can happen. ... Please stop discussing a non-feature. Python's float *does not* accept ' 一.一'. This was reported as a bug and closed as invalid. See makeunicodedata.py does not support Unihan digit data http://bugs.python.org/issue10575 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Alexander Belopolsky alexander.belopol...@gmail.com wrote: On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang corn...@gmail.com wrote: But you should be able to write: text = input(Enter a number using your preferred digits: ) num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. yes. from logical point of view, this can happen. ... Please stop discussing a non-feature. Python's float *does not* accept ' 一.一'. This was reported as a bug and closed as invalid. That seems irrelevant to me. One of the main topics of this thread is whether actual native speakers would be happy with ascii-only input for float(). haiyang kang confirmed that this is the case. I hope that more local speakers will contribute their views. Stefan Krah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Mon, Nov 29, 2010 at 2:38 PM, Alexander Belopolsky alexander.belopol...@gmail.com wrote: .. Still, if it's not detrimental and it it's not difficult to support, then why do you care? It is difficult to support. A fix for issue10557 would be much simpler if we did not support non-European digits. I now added a patch that handles non-ascii digits, so you can see what's involved. Note that when Unicode Consortium inevitably adds more Nd characters to the non-BMP planes, we will have to add surrogate pairs' support to this code. It turns out that this did in fact happen: # Newly assigned in Unicode 3.1.0 (March, 2001) .. 1D7CE..1D7FF ; 3.1 # [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE DIGIT NINE See http://unicode.org/Public/UNIDATA/DerivedAge.txt And of course, unicodedata.digit('\U0001D7CE') 0 but int('\U0001D7CE') .. UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' .. on a narrow Unicode build. (Note the character reported in the error message!) If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues: http://bugs.python.org/issue10581 (Review and document string format accepted in numeric data type constructors) http://bugs.python.org/issue10557 (Malformed error message from float()) http://bugs.python.org/issue10435 (Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal) http://bugs.python.org/issue8646 (PyUnicode_EncodeDecimal is undocumented) http://bugs.python.org/issue6632 (Include more fullwidth chars in the decimal codec) and back to the issue of user confusion http://bugs.python.org/issue652104 [closed/invalid] (int(u\u1234) raises UnicodeEncodeError by Guido van Rossum) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 30/11/2010 16:40, Alexander Belopolsky wrote: [snip...] And of course, unicodedata.digit('\U0001D7CE') 0 but int('\U0001D7CE') .. UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' .. on a narrow Unicode build. (Note the character reported in the error message!) If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues: Would moving this functionality to the locale module make the issues any easier to fix? Michael http://bugs.python.org/issue10581 (Review and document string format accepted in numeric data type constructors) http://bugs.python.org/issue10557 (Malformed error message from float()) http://bugs.python.org/issue10435 (Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal) http://bugs.python.org/issue8646 (PyUnicode_EncodeDecimal is undocumented) http://bugs.python.org/issue6632 (Include more fullwidth chars in the decimal codec) and back to the issue of user confusion http://bugs.python.org/issue652104 [closed/invalid] (int(u\u1234) raises UnicodeEncodeError by Guido van Rossum) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.voidspace.org.uk/ READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Nov 30, 2010 at 12:40 PM, Michael Foord fuzzy...@voidspace.org.uk wrote: .. If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues: Would moving this functionality to the locale module make the issues any easier to fix? Sure, if we code it in Python, supporting it will by much easier: def normalize_digits(s): digits = {m.group(1) for m in re.finditer('(\d)', s)} trtab = {ord(d): str(unicodedata.digit(d)) for d in digits} return s.translate(trtab) normalize_digits('١٢٣٤.٥٦') '1234.56' I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Sure, if we code it in Python, supporting it will by much easier: def normalize_digits(s): digits = {m.group(1) for m in re.finditer('(\d)', s)} trtab = {ord(d): str(unicodedata.digit(d)) for d in digits} return s.translate(trtab) normalize_digits('١٢٣٤.٥٦') '1234.56' I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate. It could simply be in unicodedata if you split the implementation into a core C part and some Python bits. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On Tue, Nov 30, 2010 at 1:29 PM, Antoine Pitrou solip...@pitrou.net wrote: .. I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate. It could simply be in unicodedata if you split the implementation into a core C part and some Python bits. Splitting unicodedata may not be a bad idea. There are many more pieces in UCD than covered by unicodedata. [1] Hardcoding them all into unicodedata module is hard to justify, but some are quite useful. For example, PropertyValueAliases.txt is quite useful for those like myself who cannot remember what Pd or Zl category names stand for. SpecialCasing.txt is required for proper casing, but is not currently included in Python. I would not want to change str.upper or str.title because of this, but providing the raw info to someone who wants to implement proper case mappings may not be a bad idea. Blocks.txt is certainly useful for any language-dependent processing. On the other hand, I think we should keep Unicode data and Unicode algorithms separate. And the latter may not even belong to the Python stdlib. [1] http://unicode.org/Public/UNIDATA/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 30.11.2010 09:15, schrieb Hagen Fürstenau: During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003. It's covered by As the standard library is not directly tied to the language definition it is not covered by this moratorium. How is this restricted to the stdlib if it defines the set of valid identifiers? The language does not change. The language specification says Python 3.0 introduces additional characters from outside the ASCII range (see PEP 3131). For these characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module. That remains unchanged. It was a deliberate design decision of PEP 3131 to not codify a fixed set of characters that can be used in identifiers. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Would moving this functionality to the locale module make the issues any easier to fix? You could delegate it to the C library, so: yes. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit : Would moving this functionality to the locale module make the issues any easier to fix? You could delegate it to the C library, so: yes. I hope you don't suggest delegating it to the C locale functions. Do you? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 30.11.2010 20:23, schrieb Antoine Pitrou: Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit : Would moving this functionality to the locale module make the issues any easier to fix? You could delegate it to the C library, so: yes. I hope you don't suggest delegating it to the C locale functions. Do you? Yes, I do. Why do you hope I don't? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le mardi 30 novembre 2010 à 20:40 +0100, Martin v. Löwis a écrit : Am 30.11.2010 20:23, schrieb Antoine Pitrou: Le mardi 30 novembre 2010 à 20:16 +0100, Martin v. Löwis a écrit : Would moving this functionality to the locale module make the issues any easier to fix? You could delegate it to the C library, so: yes. I hope you don't suggest delegating it to the C locale functions. Do you? Yes, I do. Why do you hope I don't? Because we all know how locale is a pile of cr*p, both in specification and in implementations. Our unit tests for it are a clear proof of that. Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Because we all know how locale is a pile of cr*p, both in specification and in implementations. Our unit tests for it are a clear proof of that. I wouldn't use expletives, but rather claim that the locale module is highly platform-dependent. Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library. By that, I stand - however, I have given up the hope that this will happen anytime soon. Wrt. to local number parsing, I think that the locale module would be way better than the nonsense that Python currently does. In the locale module, somebody at least has thought about what specifically constitutes a number. The current not-ASCII-but-not-local-either approach is just useless. Maintaining a reasonable implementation is a burden, so deferring to the C library is more attractive than having to maintain an unreasonable implementation. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Le mardi 30 novembre 2010 à 20:55 +0100, Martin v. Löwis a écrit : Wrt. to local number parsing, I think that the locale module would be way better than the nonsense that Python currently does. In the locale module, somebody at least has thought about what specifically constitutes a number. The current not-ASCII-but-not-local-either approach is just useless. It depends what you need. If you parse integers it's probably good enough. And it's better to have a trustable standard (unicode) than a myriad of ad-hoc, possibly buggy or incomplete, often unavailable, cultural specifications drafted by OS vendors who have no business (and no expertise) in drafting them. At least you can build more sophisticated routines on the simple information given to you by the unicode database. You cannot build anything solid on the C locale functions (and even then you are limited by various issues inherent in the locale semantics, such as the fact that it relies on process-wide state, which would only be ok, at best, for single-user applications). There's a reason that e.g. Babel (*) reimplements locale-like functionality from scratch. (*) http://pypi.python.org/pypi/Babel/ Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
haiyang kang corn...@gmail.com writes: I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 That's a straw man, though. The string need not be a literal in the program; it can be input to the program. num = float(input_from_the_external_world) Does that change your assessment of whether non-ASCII digits are used? -- \“The greatest tragedy in mankind's entire history may be the | `\ hijacking of morality by religion.” —Arthur C. Clarke, 1991 | _o__) | Ben Finney ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote: I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 30.11.2010 21:24, schrieb Ben Finney: haiyang kang corn...@gmail.com writes: I think it is a little ugly to have code like this: num = float(一.一), expected result is: num = 1.1 That's a straw man, though. The string need not be a literal in the program; it can be input to the program. num = float(input_from_the_external_world) Does that change your assessment of whether non-ASCII digits are used? I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. You would need a number of key strokes to enter each individual ideograph, plus you have to press the keys for keyboard layout switching to enter the Latin decimal separator (which you normally wouldn't use along with the Han numerals). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
Am 30.11.2010 23:43, schrieb Terry Reedy: On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote: I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese ASCII numerals or Arabic cursive numerals in for i in range(...) for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python and the Unicode Character Database
On 11/30/2010 10:05 AM, Alexander Belopolsky wrote: My general answers to the questions you have raised are as follows: 1. Each new feature release should use the latest version of the UCD as of the first beta release (or perhaps a week or so before). New chars are new features and the beta period can be used to (hopefully) iron out any bugs introduced by a new UCD version. 2. The language specification should not be UCD version specific. Martin pointed out that the definition of identifiers was intentionally written to not be, bu referring to 'current version' or some such. On the other hand, the UCD version used should be programatically discoverable, perhaps as an attribute of sys or str. 3.. The UCD should not change in bugfix releases. New chars are new features. Adding them in bugfix releases will introduce gratuitous imcompatibilities between releases. People who want the latest Unicode should either upgrade to the latest Python version or patch an older version (but not expect core support for any problems that creates). Given that 2.7 will be maintained for 5 years and arguably Unicode Consortium takes backward compatibility very seriously, wouldn't it make sense to consider a backport at some point? I am sure we will soon see a bug report that the following does not work in 2.7: :-) ord('\N{CAT FACE WITH WRY SMILE}') 128572 3 (cont). 2.7 is no different in that regard. It is feature frozen just like all other x.y releases. And that is the answer to any such report. If that code became valid in 2.7.2, for instance, it would still not work in 2.7 and 2.7.1. Not working is not a bug; working is a new feature introduced after 2.7 was released. - How specific should library reference manual be in defining methods affected by UCD such as str.upper()? It should specify what this actually does in Unicode terminology (probably in addition to a layman's rephrase of that) I opened an issue for this: http://bugs.python.org/issue10587 1,2 (cont). Good idea in general. I was more concerned about wide an narrow unicode CPython builds. Is it a bug that '\U'.isalpha() may disagree even when the two implementations are based on the same version of UCD? 4. While the difference between narrow/wide builds of (CPython) x.y (which should have once constant UCD) cannot be completely masked, I appreciate and generally agree with your efforts to minimize them. In some cases, there will be a conflict/tradeoff between eliminating this difference versus that. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com