Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Antoine Pitrou wrote: > FWIW, being French, I don't remember hearing any programmer wish (s)he > could use non-ASCII identifiers, in any programming language. But > arguably translitteration is very straight-forward (although a bit > lossless at times ;-)). My canonical example is François Pinard, who keeps requesting it, saying that local people where surprised they couldn't use accented characters in Python. Perhaps that's because he actually is Quebecian :-) Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
> FWIW, being French, I don't remember hearing any programmer wish (s)he > could use non-ASCII identifiers, in any programming language. But > arguably translitteration is very straight-forward (although a bit > lossless at times ;-)). > > I think typeability and reproduceability should be weighted carefully. > It's nice to have the real letter delta instead of "delta", but how do I > type it again on my non-Greek keyboard if I want to keep consistent > naming in the program? > > ASCII is ethnocentric, but it probably can be typed easily with every > device in the world. > > Also, as a matter of fact, if I type an identifier with an accented > letter inside, I would like Python to warn me, because it would be a > typing error on my part. > > Maybe this should be an option at the beginning of any source file (like > encoding currently). Or is this overkill? I'm also French and I must say that I agree with you. In my case, the most important thing is to be able to manage the _data_ in the good encoding. I'm currently trying to implement a little search engine in python (to improve my skills mainly) and the biggest problem I have to face is how to manage encoding. Some web pages are in French, in German, in English, etc. and it take me a lot of time to handle this problem correctly. I think it's more useful to be able to manipulate simply the _data_ than to have accents in identifiers. -- Derrière chaque bogue, il y a un développeur, un homme qui s'est trompé. (Bon, OK, parfois ils s'y mettent à plusieurs). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
> Thanks for these data. This mostly reflects my experience with German > and French users: some people would like to use non-ASCII identifiers > if they could, other argue they never would as a matter of principle. > Of course, transliteration is more straight-forward. FWIW, being French, I don't remember hearing any programmer wish (s)he could use non-ASCII identifiers, in any programming language. But arguably translitteration is very straight-forward (although a bit lossless at times ;-)). I think typeability and reproduceability should be weighted carefully. It's nice to have the real letter delta instead of "delta", but how do I type it again on my non-Greek keyboard if I want to keep consistent naming in the program? ASCII is ethnocentric, but it probably can be typed easily with every device in the world. Also, as a matter of fact, if I type an identifier with an accented letter inside, I would like Python to warn me, because it would be a typing error on my part. Maybe this should be an option at the beginning of any source file (like encoding currently). Or is this overkill? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
On Sat, 2005-10-29 at 10:56 +0200, "Martin v. Löwis" wrote: > Atsuo Ishimoto wrote: > > I'm +0.1 for non-ASCII identifiers, although module names should remain > > ASCII. ASCII identifiers might be encouraged, but as Martin said, it is > > very useful for some groups of users. > > Thanks for these data. This mostly reflects my experience with German > and French users: some people would like to use non-ASCII identifiers > if they could, other argue they never would as a matter of principle. > Of course, transliteration is more straight-forward. Not sure if anyone has made this point already, but unicode identifiers are also useful for math programs. The ability to directly type the math letters, like alpha, omega, etc., would actually make the code more readable, while still understandable by programmers of all nationalities. For instance, you could write: Δv = x1 - x0 if Δv < ε: return Instead of: delta_v = x1 - x0 if delta_v < epsilon: return But anyone that is supposed to understand the code will be able to read the delta and epsilon symbols. Regards. -- Gustavo J. A. M. Carneiro <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> The universe is always one step beyond logic ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Atsuo Ishimoto wrote: > I'm +0.1 for non-ASCII identifiers, although module names should remain > ASCII. ASCII identifiers might be encouraged, but as Martin said, it is > very useful for some groups of users. Thanks for these data. This mostly reflects my experience with German and French users: some people would like to use non-ASCII identifiers if they could, other argue they never would as a matter of principle. Of course, transliteration is more straight-forward. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Hello from Japan, I googled discussions about non-ASCII identifiers in Japanese, but I found no consensus. Major languages such as Java or VB support non-ASCII identifiers, so projects uses non-ASCII identifiers for their programs are existing. Not all Japanese programmers think this is a good idea. Some people enthusiastically prefer Japanese identifiers, but some feel it reduces readability and hard to type, some worry about tool breakages or encoding problem, etc. It looks that smart people don't like to express their preference to Japanese identifiers, maybe because they think such style is not cool, or they are afraid such confession may reveal lack of their English ability.;) I'm +0.1 for non-ASCII identifiers, although module names should remain ASCII. ASCII identifiers might be encouraged, but as Martin said, it is very useful for some groups of users. On Sat, 29 Oct 2005 00:21:03 +0200 "Martin v. Lvwis" <[EMAIL PROTECTED]> wrote: > Neil Hodgson wrote: > >This is anecdotal but it appears to me that transliterations are > > not commonly used apart from learning languages and some minimal help > > for foreigners such as including transliterated names on railway > > station name boards. > > That would be my guess also. Transliteration is clearly common for > Latin-based languages (French, German, Spanish, say), but I doubt > non-Latin scripts are that often transliterated (even if conventions > exist). > Yes, transliterations are rarely used in daily life in Japan. For programming, I know a lot of projects use transliterated Japanses style, but I guess they are rather minority. -- Atsuo Ishimoto [EMAIL PROTECTED] Homepage:http://www.gembook.jp ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Neil Hodgson wrote: >This is anecdotal but it appears to me that transliterations are > not commonly used apart from learning languages and some minimal help > for foreigners such as including transliterated names on railway > station name boards. That would be my guess also. Transliteration is clearly common for Latin-based languages (French, German, Spanish, say), but I doubt non-Latin scripts are that often transliterated (even if conventions exist). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
On 10/28/05, Neil Hodgson <[EMAIL PROTECTED]> wrote: >I used to work on software written by Japanese and English speakers > at Fujitsu with most developers being Japanese. The rules were that > comments could be in Japanese but identifiers were only allowed to > contain ASCII characters. Most variable names were poorly chosen with > s, p, q, fla (boolean=flag) and flafla being popular. When I asked > some Japanese coders why they didn't use Japanese words expressed in > ASCII (Romaji), their response was that it was a really weird idea. > >This is anecdotal but it appears to me that transliterations are > not commonly used apart from learning languages and some minimal help > for foreigners such as including transliterated names on railway > station name boards. Israeli programmers generally use English identifiers but transliterations are common for local business terminology: types of financial instruments, tax and insurance terminology, employee benefit plans etc. Yes, it looks weird, but it would be rather pointless to try to translate them. Even native English speakers would find it difficult to recognize the translations because they are used to using them as loan words. Only transliteration (or possibly the use of non-ASCII identifiers) would make sense in this situation and I do not think it is unique to Israel. BTW, I heard about a Cobol shop that had an explicit policy of using only transliterated identifiers. This resulted in a much smaller chance of hitting one of Cobol's numerous reserved words. Thankfully, this is not an issue in Python... Oren ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
> "Neil" == Neil Hodgson <[EMAIL PROTECTED]> writes: Neil> Most variable names were poorly chosen with s, p, q, fla Neil> (boolean=flag) and flafla being popular. When I asked some Neil> Japanese coders why they didn't use Japanese words expressed Neil> in ASCII (Romaji), their response was that it was a really Neil> weird idea. That may be due to the fact that two-ideograph words will often have a dozen homonyms, and sometimes several dozen. I sometimes use kanji in not-for-general-distribution Emacs LISP code when 2 kanji will give as expressive an identifier as 10 or 15 ASCII characters. Neil> This is anecdotal but it appears to me that transliterations Neil> are not commonly used apart from learning languages In everyday usage, they're used a lot for identifier-like purposes like corporate logos. The only large corpuses of Japanese-oriented Japanese-authored code I'm familiar with are the input methods Wnn, Canna, and SKK, and these invariably use transliterated Japanese grammatical terms for parser components[1], although there are perfectly good equivalents in English, at least (I think they may actually be standardized by the Ministry of Education). There's also an Emacs library, edict.el, that uses _mixed_ ASCII-hiragana-kanji identifiers. (ISTR that was done just to prove a point---the person who wrote it was an American, I believe---definitely not Japanese.) Footnotes: [1] Japanese does not require word delimiters, so input methods must have grammatical knowledge to choose among large numbers of homonyms. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson: > According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet), > various languages have adopted a transliteration of their language > and/or former alphabets into latin. They don't purport to know all of > the reasons why, and I'm not going to speculate. I used to work on software written by Japanese and English speakers at Fujitsu with most developers being Japanese. The rules were that comments could be in Japanese but identifiers were only allowed to contain ASCII characters. Most variable names were poorly chosen with s, p, q, fla (boolean=flag) and flafla being popular. When I asked some Japanese coders why they didn't use Japanese words expressed in ASCII (Romaji), their response was that it was a really weird idea. This is anecdotal but it appears to me that transliterations are not commonly used apart from learning languages and some minimal help for foreigners such as including transliterated names on railway station name boards. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Greg Ewing wrote: > I still think this is a much worse potential problem > than that of "l" vs "1", etc. It's reasonable to > adopt the practice of never using "l" as a single > letter identifier, for example. But it would be > unreasonable to ban the use of "E" as an identifier > on the grounds that someone somewhere might confuse > it with a capital epsilon. As a style guide, people should use single-letter identifiers only for local variables. If they follow the guideline, it should be easy to tell whether such an identifier is Latin or Greek (if everything else in the function is Latin, the E likely is as well). > An alternative would be to identify such confusable > letters in the various alphabets and define them > to be equivalent. pylint could check for such things (although I very much doubt it would have any hits in the next 10 years). > And beyond the issue of alphabets there's also the > question of whether accented characters should be > considered distinct. I can see quite a few holy > flame wars erupting over that... For that, there is the Unicode TR that precisely defines how this should be done. People should then have their wars with the Unicode consortium. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Greg Ewing wrote: > M.-A. Lemburg wrote: > > >>If you are told to debug a program >>written by say a Japanese programmer using Japanese identifiers >>you are going to have a really hard time. > > > Or you could look upon it as an opportunity to > broaden your mental horizons by learning some > Japanese. :-) I just took Japanese as exmaple for a language and script that I don't know anything about. I would actually love to learn some Japanese, but simply don't have the time for learning it. Anyway, I could just as well have chosen Tibetian, Thai or Limbu scripts (which all look very nice, BTW): http://www.unicode.org/charts/ Perhaps this is not as bad after all - I just don't think that it will help code readability in the long run. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 27 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>You even argued against having non-ASCII identifiers: >> >>http://mail.python.org/pipermail/python-list/2002-May/102936.html > > > I see :-) It seems I have changed my mind since then (which > apparently predates PEP 263). > > One issue I apparently was worried about was the plan to use > native-encoding byte strings for the identifiers; this I didn't > like at all. > > >>* Unicode identifiers are going to introduce massive >>code breakage - just think of all the tools people use >>to manipulate Python code today; I'm quite sure that >>most of it will fail in one way or another if you present >>it Unicode literals such as in "zähler += 1". > > > True. Today, I think I would be willing to accept the > code breakage: these tools had quite some time to update > themselves to PEP 263 (even though not all of them have > done so yet); also, usage of the feature would only spread > gradually. A failure to support the feature in the Python > proper would be treated as a bug by us; how tool providers > deal with the feature would be their choice. I was thinking of introspection and debugging tools. These would then see Unicode objects in the namespace dictionaries and this will likely break a lot of code - much for the same reason you see code breakage now if you let Unicode object enter the Python standard lib without warning :-) >>* People don't seem very interested in using Unicode >>identifiers, e.g. >> >> http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html > > > True. However, I also suspect that lack of tool support > contributes to that. For the specific case of Java, > there is no notion of source encoding, which makes Unicode > identifiers really tedious to use. > > If it were really easy to use, I assume people would actually > use it - atleast in some of the contexts, like teaching, > where Python is also widely used. Well, that has two sides: Of course, you'll always find some people that will like a certain feature. The question is what effects does it have on the rest of us. Python has always put some constraints on programmers to raise code readability, e.g. white space awareness. Giving them Unicode identifiers sounds like a step backwards in this context. Note that I'm not talking about comments, string literal contents, etc. - only the programming logic, ie. keywords and identifiers. >>Do you really think that it will help with code readability >>if programmers are allowed to use native scripts for their >>identifiers ? > > > Yes, I do - for some groups of users. Of course, code sharing > would be more difficult, and there certainly should be a policy > to use only ASCII in the standard library. But within local > groups, users would find understanding code easier if they > knew what the identifiers actually meant. Hmm, but why do you think they wouldn't understand the meaning of ASCII versions of the identifiers ? Note that using ASCII doesn't necessarily mean that you have to use English as basis for the naming schemes of identifiers. >>If you are told to debug a program >>written by say a Japanese programmer using Japanese identifiers >>you are going to have a really hard time. Integrating such >>code into other applications will be even harder, since you'd >>be forced to use his Japanese class names in your application. > > > Certainly, yes. There is a trade-off: you can make it easier > for some people to read and write code if they can use their > native script; at the same time, it would be harder for others > to read and modify it. > > It's a policy decision whether you use English identifiers or > not - it shouldn't be a technical decision (as it currently > is). See above: ASCII != English. Most scripts have a transliteration into ASCII - simply because that's the global standard for scripts. >>I think source code encodings provide an ideal way to >>have comments written in native scripts - and people >>use that a lot. However, keeping the program code itself >>in plain ASCII makes it far more readable and reusable >>across locales. Something that's important in this >>globalized world. > > > Certainly. However, some programs don't need to live in > a globalized world - e.g. if they are homework in a school. > Within a locale, using native scripts would make the program > more readable. True. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 27 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/py
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Josiah Carlson wrote: > > According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet), > > various languages have adopted a transliteration of their language > > and/or former alphabets into latin. They don't purport to know all of > > the reasons why, and I'm not going to speculate. > > > > Whether or not more languages start using the latin alphabet is a good > > question. Basing judgement on history and likely globalization, it is > > only a matter of time before basically all languages have a > > transcription into the latin alphabet that is taught to all (unless > > China takes over the world). > > That is a very U.S. centric view. I don't share it, but I think it is > pointless to argue against it. I should have included a ;). Whether or not in the future all languages use the latin alphabet should have little to do with whether Python chooses to support non-latin identifiers in the forthcoming 2.5 or later releases. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Martin v. Löwis wrote: > Not in the literal sense: you certainly want to allow > "latin" digits in, say, a cyrillic identifier. Yes, by "alphabet" I really only meant the letters, although you might want to apply the same idea to clusters of digits within an identifier, depending on how potentially confusable the various sets of digits are -- I'm not familiar enough with alternative digit sets to know whether that would be a problem. > Just because > you *can* come up with look-alike identifiers doesn't > mean that people will use them, or that they will mistake > the scripts (except for deliberately doing so, of > course). I still think this is a much worse potential problem than that of "l" vs "1", etc. It's reasonable to adopt the practice of never using "l" as a single letter identifier, for example. But it would be unreasonable to ban the use of "E" as an identifier on the grounds that someone somewhere might confuse it with a capital epsilon. An alternative would be to identify such confusable letters in the various alphabets and define them to be equivalent. And beyond the issue of alphabets there's also the question of whether accented characters should be considered distinct. I can see quite a few holy flame wars erupting over that... -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
M.-A. Lemburg wrote: > If you are told to debug a program > written by say a Japanese programmer using Japanese identifiers > you are going to have a really hard time. Or you could look upon it as an opportunity to broaden your mental horizons by learning some Japanese. :-) -- Greg Ewing, Computer Science Dept, +--+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | [EMAIL PROTECTED] +--+ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson wrote: > According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet), > various languages have adopted a transliteration of their language > and/or former alphabets into latin. They don't purport to know all of > the reasons why, and I'm not going to speculate. > > Whether or not more languages start using the latin alphabet is a good > question. Basing judgement on history and likely globalization, it is > only a matter of time before basically all languages have a > transcription into the latin alphabet that is taught to all (unless > China takes over the world). That is a very U.S. centric view. I don't share it, but I think it is pointless to argue against it. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Josiah Carlson wrote: > > In this case it's not just a misreading, the characters look identical! > > When is an 'E' not an 'E'? When it is an Epsilon or Ie. Saying what > > characters will or will not be used as identifiers, when those > > characters are keys on a keyboard of a specific type, is pretty > > presumptuous. > > Why is that rude and disrespectful? I'm certainly respecting developers > who want to use their scripts for identifiers, or else I would not have > suggested that they could do so. I never said rude, I said presumptuous. "Going beyond what is right or proper; excessively forward." (according to dictionary.com, the OED has a similar definition). I was trying to say that in stating that users wouldn't be using keys on their keyboard in their natual language when also using english characters, that you were assuming a bit about their usage patterns that you perhaps shouldn't. I certainly could also be presumptuous in stating that users may very well mix certain languages, but it seems to be more likely given keywords and the standard library using the latin alphabet. > > Indeed, they are similar, but_ different_ in my font as well. The trick > > is that the glyphs are not different in the case of certain greek or > > cyrillic letters. They don't just /look/ similar they /are identical/. > > This string: "EÎ" is the LATIN CAPITAL LETTER E, followed by the GREEK > CAPITAL LETTER EPSILON. In the font my email composer uses, the E is > slightly larger than the Epsilon - so there /is/ a visual difference. My email client doesn't handle unicode, but a quick check by swapping fonts in a word processor provides that at least on my platform, all three are the same glyph (same size, shape, ...) for all fixed-width fonts. If a platform distinguishes all three, then one should consider one's platform lucky. Not all platforms and/or preferred fonts of users are. > But even if there isn't: if this was a frequent problem, the name > error could include an alternative representation (say, with Unicode > ordinals for non-ASCII characters) which would give an easy visual > clue. It would offer a great cue, but I'm not sure if it is possible. I think that it sounds like an ugly discussion of stdout/err encodings and exception handling machinery that I don't want to be a part of. > I still doubt that this is a frequent problem, and I don't see any > better grounds for claiming that it is than for claiming that it > is not. Whether or not it is frequent will depend on the prevalence of desire to use those characters. While I don't think that such uses will be as common as using 'klass' when passing a class, I do think that it will result in more than a few sf bug reports. I also share Marc-Andre Lemburg's concerns about the understandability of code written in Kanji, Hebrew, Arabic, etc., at least for those who have not memorized the entirety of those alphabets. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > M.-A. Lemburg wrote: > > You even argued against having non-ASCII identifiers: > > > > http://mail.python.org/pipermail/python-list/2002-May/102936.html > > > > Do you really think that it will help with code readability > > if programmers are allowed to use native scripts for their > > identifiers ? > > Yes, I do - for some groups of users. Of course, code sharing > would be more difficult, and there certainly should be a policy > to use only ASCII in the standard library. But within local > groups, users would find understanding code easier if they > knew what the identifiers actually meant. According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet), various languages have adopted a transliteration of their language and/or former alphabets into latin. They don't purport to know all of the reasons why, and I'm not going to speculate. Whether or not more languages start using the latin alphabet is a good question. Basing judgement on history and likely globalization, it is only a matter of time before basically all languages have a transcription into the latin alphabet that is taught to all (unless China takes over the world). - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
M.-A. Lemburg wrote: > You even argued against having non-ASCII identifiers: > > http://mail.python.org/pipermail/python-list/2002-May/102936.html I see :-) It seems I have changed my mind since then (which apparently predates PEP 263). One issue I apparently was worried about was the plan to use native-encoding byte strings for the identifiers; this I didn't like at all. > * Unicode identifiers are going to introduce massive > code breakage - just think of all the tools people use > to manipulate Python code today; I'm quite sure that > most of it will fail in one way or another if you present > it Unicode literals such as in "zähler += 1". True. Today, I think I would be willing to accept the code breakage: these tools had quite some time to update themselves to PEP 263 (even though not all of them have done so yet); also, usage of the feature would only spread gradually. A failure to support the feature in the Python proper would be treated as a bug by us; how tool providers deal with the feature would be their choice. > * People don't seem very interested in using Unicode > identifiers, e.g. > > http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html True. However, I also suspect that lack of tool support contributes to that. For the specific case of Java, there is no notion of source encoding, which makes Unicode identifiers really tedious to use. If it were really easy to use, I assume people would actually use it - atleast in some of the contexts, like teaching, where Python is also widely used. > Do you really think that it will help with code readability > if programmers are allowed to use native scripts for their > identifiers ? Yes, I do - for some groups of users. Of course, code sharing would be more difficult, and there certainly should be a policy to use only ASCII in the standard library. But within local groups, users would find understanding code easier if they knew what the identifiers actually meant. > If you are told to debug a program > written by say a Japanese programmer using Japanese identifiers > you are going to have a really hard time. Integrating such > code into other applications will be even harder, since you'd > be forced to use his Japanese class names in your application. Certainly, yes. There is a trade-off: you can make it easier for some people to read and write code if they can use their native script; at the same time, it would be harder for others to read and modify it. It's a policy decision whether you use English identifiers or not - it shouldn't be a technical decision (as it currently is). > I think source code encodings provide an ideal way to > have comments written in native scripts - and people > use that a lot. However, keeping the program code itself > in plain ASCII makes it far more readable and reusable > across locales. Something that's important in this > globalized world. Certainly. However, some programs don't need to live in a globalized world - e.g. if they are homework in a school. Within a locale, using native scripts would make the program more readable. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Divorcing str and unicode (no more implicitconversions).
Greg Ewing asked: > Would it help if an identifier were required to be > made up of letters from the same alphabet, e.g. all > Latin or all Greek or all Cyrillic, but not a mixture. Probably, yes, though there could still be problems mixing within a program. FWIW, the Opera web browser is already using a similar solution. Domain names are limited to Latin-1 *unless* the top-level registrar has a policy to prevent spoofing. -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>A few years ago we had a discussion about this on python-dev >>and agreed to stick with ASCII identifiers for Python. I still >>think that's the right way to go. > > I don't think there ever was such an agreement. You even argued against having non-ASCII identifiers: http://mail.python.org/pipermail/python-list/2002-May/102936.html and I agree with you on most of the points you make in that posting: * Unicode identifiers are going to introduce massive code breakage - just think of all the tools people use to manipulate Python code today; I'm quite sure that most of it will fail in one way or another if you present it Unicode literals such as in "zähler += 1". * People don't seem very interested in using Unicode identifiers, e.g. http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html most of the few who did comment, said they'd rather have ASCII identifiers, e.g. http://mail.python.org/pipermail/python-list/2002-May/104050.html Do you really think that it will help with code readability if programmers are allowed to use native scripts for their identifiers ? I think this goes beyond just visual aspects of being able to distinguish graphemes: If you are told to debug a program written by say a Japanese programmer using Japanese identifiers you are going to have a really hard time. Integrating such code into other applications will be even harder, since you'd be forced to use his Japanese class names in your application. This doesn't only introduce problems with being able to enter the Japanese identifiers, it will also cause your application to suddenly contain identifiers in Japanese even though that's not your native script. I think source code encodings provide an ideal way to have comments written in native scripts - and people use that a lot. However, keeping the program code itself in plain ASCII makes it far more readable and reusable across locales. Something that's important in this globalized world. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 26 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Am 25.10.2005 um 23:40 schrieb Josiah Carlson: > [...] > Identically drawn glyphs are a problem, and pretending that they > aren't > a problem, doesn't make it so. Right now, all possible name glyphs > are > visually distinct, which would not be the case if any unicode > character > could be used as a name (except for numerals). Speaking of which, > would > we then be offering support for arabic/indic numeric literals, and/or > support it in int()/float()? It's already supported in int() and float() >>> int(u"\u136c\u2082") 42 >>> float(u"\u0664\u09e8") 42.0 But not as literals: # -*- coding: unicode-escape -*- print \u136c\u2082 This gives (on the Mac): File "encoding.py", line 3 print ፬₂ ^ SyntaxError: invalid syntax > [...] Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
> "Josiah" == Josiah Carlson <[EMAIL PROTECTED]> writes: Josiah> Indeed, they are similar, but_ different_ in my font as Josiah> well. The trick is that the glyphs are not different in Josiah> the case of certain greek or cyrillic letters. They don't Josiah> just /look/ similar they /are identical/. But these problems are going to arise in _any_ multilingual context; it's not at all specific to identifiers. It's just that computers lexing identifiers are kinda picky about those things compared to humans. I think you can reasonably classify it as a new breed of typo, and develop UIs to deal with it in that way. To handle cases where glyphs are (nearly) identical, UIs that visually flag "foreign" characters, at least in contexts where cross-block punning is unacceptable, will be developed, and users will learn to pay attention to those flags. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Greg Ewing wrote: > Would it help if an identifier were required to be > made up of letters from the same alphabet, e.g. all > Latin or all Greek or all Cyrillic, but not a mixture. > Then you'd get an immediate error if you accidentally > slipped in a letter from the wrong alphabet. Not in the literal sense: you certainly want to allow "latin" digits in, say, a cyrillic identifier.See http://www.unicode.org/reports/tr31/ for what the Unicode consortium recommends to do. In addition to the strict specification, they envision usage guidelines. This seems Pythonic: just because you could potentially shoot yourself in the foot doesn't mean it should be banned from the language. IOW, whether it would help largely depends on whether the problem is real in the first place. Just because you *can* come up with look-alike identifiers doesn't mean that people will use them, or that they will mistake the scripts (except for deliberately doing so, of course). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson wrote: > In this case it's not just a misreading, the characters look identical! > When is an 'E' not an 'E'? When it is an Epsilon or Ie. Saying what > characters will or will not be used as identifiers, when those > characters are keys on a keyboard of a specific type, is pretty > presumptuous. Why is that rude and disrespectful? I'm certainly respecting developers who want to use their scripts for identifiers, or else I would not have suggested that they could do so. However, from the experience with my own language, and the three or so foreign languages I know, I can tell you that people would normally don't mix identifiers of different scripts. > Sure, that example was made up, but there are words which have been > stolen from various languages by english, and you are discounting the > case of single-letter temporary variables. Saying what will and won't > happen over the course of using unicode identifiers is quite the > prediction. Sure, people can make mistakes. They get an error, and then will need to find the cause of the problem. Sometimes, this will be easy, and sometimes, it will not. > Indeed, they are similar, but_ different_ in my font as well. The trick > is that the glyphs are not different in the case of certain greek or > cyrillic letters. They don't just /look/ similar they /are identical/. This string: "EΕ" is the LATIN CAPITAL LETTER E, followed by the GREEK CAPITAL LETTER EPSILON. In the font my email composer uses, the E is slightly larger than the Epsilon - so there /is/ a visual difference. But even if there isn't: if this was a frequent problem, the name error could include an alternative representation (say, with Unicode ordinals for non-ASCII characters) which would give an easy visual clue. I still doubt that this is a frequent problem, and I don't see any better grounds for claiming that it is than for claiming that it is not. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Martin v. Löwis wrote: > For window.draw, people will readily understand that > they are supposed to use Latin letters. More generally, they will know > what script to use just from looking at the identifier. Would it help if an identifier were required to be made up of letters from the same alphabet, e.g. all Latin or all Greek or all Cyrillic, but not a mixture. Then you'd get an immediate error if you accidentally slipped in a letter from the wrong alphabet. Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Martin v. Löwis: > This aspect of rendering is often not implemented, though. Web browsers > do it correctly, see > ... > GUI frameworks sometimes do it correctly, sometimes don't; most > notably, Tk has no good support for RTL text. Scintilla does a rough job with this. RTL text is displayed correctly as the underlying platform libraries (Windows or GTK+/Pango) handle this aspect when called to draw text. However editing is not performed correctly with the caret not being placed correctly within RTL text and other visual glitches. There is interest in the area and even a funding proposal this week. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Guido van Rossum <[EMAIL PROTECTED]> wrote: > > On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote: > > Indeed, they are similar, but_ different_ in my font as well. The trick > > is that the glyphs are not different in the case of certain greek or > > cyrillic letters. They don't just /look/ similar they /are identical/. > > Well, in the font I'm using to read this email, I and l are /identical/. In all fonts I've seen, E/Epsilon/Ie are /always identical/. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote: > Indeed, they are similar, but_ different_ in my font as well. The trick > is that the glyphs are not different in the case of certain greek or > cyrillic letters. They don't just /look/ similar they /are identical/. Well, in the font I'm using to read this email, I and l are /identical/. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Josiah Carlson wrote: > > And how users could say, "name error? But I typed in window.draw(PEN) as > > I was told to, and it didn't work!" > > Ah, so the "serious issues" you are talking about are not security > issues, but usability issues. Indeed, it was a misunderstanding, as the email stated: I did not mean to imply that I was concerned about the security implications of inserting arbitrary identifiers in Python (I was mentioning the web browser case for an example of how such characters have been confusing previously), I am concerned about confusion involved with using: [glyphs which are identical] > I don't think extending the range of acceptable characters will > cause any additional confusion. Users are already getting "surprising" > NameErrors/AttributeErrors in the following cases: > - they just misspell the identifier, and then, when the error message >is printed, fail to recognize the difference, as they read over the >typo just like they read over it when mistyping it in the first place. In this case it's not just a misreading, the characters look identical! When is an 'E' not an 'E'? When it is an Epsilon or Ie. Saying what characters will or will not be used as identifiers, when those characters are keys on a keyboard of a specific type, is pretty presumptuous. > - they run into confusions with different things having the same names >in different contexts. For example, they wonder why they get TypeError >for passing the wrong number of arguments to a function, when the >call matches exactly what the source code in front of them tells >them - only that they were calling a different function which just >happened to have the same name. Right, and users should be reading the documentation for the functions and methods they are calling. > In the light of these common mistakes, your example with an identifier > named PEN, where the "P" might be a cyrillic letter or the E a greek one > is just made up: For window.draw, people will readily understand that > they are supposed to use Latin letters. More generally, they will know > what script to use just from looking at the identifier. Sure, that example was made up, but there are words which have been stolen from various languages by english, and you are discounting the case of single-letter temporary variables. Saying what will and won't happen over the course of using unicode identifiers is quite the prediction. > > Identically drawn glyphs are a problem, and pretending that they aren't > > a problem, doesn't make it so. Right now, all possible name glyphs are > > visually distinct > > Not at all: Just compare Fool and Foo1 (and perhaps FooI) > > In the font in which I'm typing this, these are slightly different - but > there are fonts in which the difference is really difficult to > recognize. Indeed, they are similar, but_ different_ in my font as well. The trick is that the glyphs are not different in the case of certain greek or cyrillic letters. They don't just /look/ similar they /are identical/. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Guido van Rossum wrote: > This actually seems a killer even for allowing Unicode in comments, > which I'd otherwise favor. What do Unicode-aware apps generally do > with right-to-left characters? The Unicode standard has an elaborate definition of what should happen. There are many rules to it, but essentially, there is the notion of a "primary" direction, which then is toggled based on the directionality of each character (unicodedata.bidirectional). There are also formatting characters which toggle the direction. This aspect of rendering is often not implemented, though. Web browsers do it correctly, see http://he.wikipedia.org/wiki/Python where all text should come out right-adjusted, yet the Latin fragments are still left to right (such as "Guido van Rossum") Integrating it into this text looks like this: פייתון (Python). GUI frameworks sometimes do it correctly, sometimes don't; most notably, Tk has no good support for RTL text. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
M.-A. Lemburg wrote: > A few years ago we had a discussion about this on python-dev > and agreed to stick with ASCII identifiers for Python. I still > think that's the right way to go. I don't think there ever was such an agreement. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson wrote: > And how users could say, "name error? But I typed in window.draw(PEN) as > I was told to, and it didn't work!" Ah, so the "serious issues" you are talking about are not security issues, but usability issues. I don't think extending the range of acceptable characters will cause any additional confusion. Users are already getting "surprising" NameErrors/AttributeErrors in the following cases: - they just misspell the identifier, and then, when the error message is printed, fail to recognize the difference, as they read over the typo just like they read over it when mistyping it in the first place. - they run into confusions with different things having the same names in different contexts. For example, they wonder why they get TypeError for passing the wrong number of arguments to a function, when the call matches exactly what the source code in front of them tells them - only that they were calling a different function which just happened to have the same name. In the light of these common mistakes, your example with an identifier named PEN, where the "P" might be a cyrillic letter or the E a greek one is just made up: For window.draw, people will readily understand that they are supposed to use Latin letters. More generally, they will know what script to use just from looking at the identifier. > Identically drawn glyphs are a problem, and pretending that they aren't > a problem, doesn't make it so. Right now, all possible name glyphs are > visually distinct Not at all: Just compare Fool and Foo1 (and perhaps FooI) In the font in which I'm typing this, these are slightly different - but there are fonts in which the difference is really difficult to recognize. > Speaking of which, would > we then be offering support for arabic/indic numeric literals, and/or > support it in int()/float()? No. None of the Arabic users have ever requested such a feature, so it would be stupid to provide it. We provide extended identifiers not for the fun of it, but because users are requesting them. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote: > Identically drawn glyphs are a problem, and pretending that they aren't > a problem, doesn't make it so. Right now, all possible name glyphs are > visually distinct, which would not be the case if any unicode character > could be used as a name (except for numerals). Speaking of which, would > we then be offering support for arabic/indic numeric literals, and/or > support it in int()/float()? Ideally I would like to say yes, but I > could see the confusion if such were allowed. This problem isn't new. There are plenty of fonts where 1 and l are hard to distinguish, or l and I for that matter, or O and 0. Yes, we need better tools to diagnose this. No, we shouldn't let this stop us from adding such a feature if it is otherwise a good feature. I'm not so sure about this for other reasons -- it hampers code sharing, and as soon as you add right-to-left character sets to the mix (or top-to-bottom, for that matter), displaying source code is going to be near impossible for most tools (since the keywords and standard library module names will still be in the Latin alphabet). This actually seems a killer even for allowing Unicode in comments, which I'd otherwise favor. What do Unicode-aware apps generally do with right-to-left characters? -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson wrote: > "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > >>Fredrik Lundh wrote: >> >>>however, for Python 3000, it would be nice if the source-code encoding >>>applied >>>to the *entire* file (XML-style), rather than just unicode string literals >>>and (hope- >>>fully) comments and docstrings. >> >>As MAL explains, the encoding currently does apply to the entire file. >>However, because of the Python syntax, you are restricted to ASCII >>in many places, such as keywords, number literals, and (unfortunately) >>identifiers. Lifting the restriction on identifiers is on my agenda. > > > It seems that removing this restriction may cause serious issues, at > least in the case when using cyrillic characters in names. See recent > security issues in regards to web addresses in web browsers for the > confusion (and/or name errors) that could result in their use. > > While I agree in principle that people should be able to use the > entirety of one's own natural language in writing software in > programming languages, I think that it is an ugly can of worms that > perhaps shouldn't be opened. I agree with Josiah. A few years ago we had a discussion about this on python-dev and agreed to stick with ASCII identifiers for Python. I still think that's the right way to go. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Josiah Carlson wrote: > > It seems that removing this restriction may cause serious issues, at > > least in the case when using cyrillic characters in names. See recent > > security issues in regards to web addresses in web browsers for the > > confusion (and/or name errors) that could result in their use. > > That impression is deceiving. We are talking about source code here; > people type in identifiers explicitly rather than receiving them > through linking, and they scope identifiers (by module or object). > > If somebody manages to get look-alike identifiers into your Python > libraries, you have bigger problems than these look-alikes: anybody > capable of doing so could just as well replace the real thing in > the first place. > > As always in computer security: define your threat model before > reasoning about the risks. I should have been more explicit. I did not mean to imply that I was concerned about the security implications of inserting arbitrary identifiers in Python (I was mentioning the web browser case for an example of how such characters have been confusing previously), I am concerned about confusion involved with using: Greek Capital: Alpha, Beta, Epsilon, Zeta, Eta, Iota, Kappa, Mu, Nu, Omicron, Rho, and Tau. Cyrillic Capital: Dze, Je, A, Ve, Ie, Em, En, O, Er, Es, Te, Ha, ... And how users could say, "name error? But I typed in window.draw(PEN) as I was told to, and it didn't work!" Identically drawn glyphs are a problem, and pretending that they aren't a problem, doesn't make it so. Right now, all possible name glyphs are visually distinct, which would not be the case if any unicode character could be used as a name (except for numerals). Speaking of which, would we then be offering support for arabic/indic numeric literals, and/or support it in int()/float()? Ideally I would like to say yes, but I could see the confusion if such were allowed. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Josiah Carlson wrote: > It seems that removing this restriction may cause serious issues, at > least in the case when using cyrillic characters in names. See recent > security issues in regards to web addresses in web browsers for the > confusion (and/or name errors) that could result in their use. That impression is deceiving. We are talking about source code here; people type in identifiers explicitly rather than receiving them through linking, and they scope identifiers (by module or object). If somebody manages to get look-alike identifiers into your Python libraries, you have bigger problems than these look-alikes: anybody capable of doing so could just as well replace the real thing in the first place. As always in computer security: define your threat model before reasoning about the risks. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
"Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Fredrik Lundh wrote: > > however, for Python 3000, it would be nice if the source-code encoding > > applied > > to the *entire* file (XML-style), rather than just unicode string literals > > and (hope- > > fully) comments and docstrings. > > As MAL explains, the encoding currently does apply to the entire file. > However, because of the Python syntax, you are restricted to ASCII > in many places, such as keywords, number literals, and (unfortunately) > identifiers. Lifting the restriction on identifiers is on my agenda. It seems that removing this restriction may cause serious issues, at least in the case when using cyrillic characters in names. See recent security issues in regards to web addresses in web browsers for the confusion (and/or name errors) that could result in their use. While I agree in principle that people should be able to use the entirety of one's own natural language in writing software in programming languages, I think that it is an ugly can of worms that perhaps shouldn't be opened. - Josiah ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Fredrik Lundh wrote: > however, for Python 3000, it would be nice if the source-code encoding applied > to the *entire* file (XML-style), rather than just unicode string literals > and (hope- > fully) comments and docstrings. As MAL explains, the encoding currently does apply to the entire file. However, because of the Python syntax, you are restricted to ASCII in many places, such as keywords, number literals, and (unfortunately) identifiers. Lifting the restriction on identifiers is on my agenda. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
Fredrik Lundh wrote: > M.-A. Lemburg wrote: > > >>I don't follow you here. The source code encoding >>is only applied to Unicode literals (you are using string >>literals in your example). String literals are passed >>through as-is. > > > however, for Python 3000, it would be nice if the source-code encoding applied > to the *entire* file (XML-style), rather than just unicode string literals > and (hope- > fully) comments and docstrings. Actually, the encoding is applied to the complete source file: the file is transcoded into UTF-8 and then parsed by the Python parser. Unicode literals are then decoded from the UTF-8 into Unicode. String literals are transcoded back into the source code encoding, thus making the (rather long due to technical constraints) round-trip source code encoding -> Unicode -> UTF-8 -> Unicode -> source code encoding. Python 3k should have a fully Unicode based parser to reduce this additional transcoding overhead. Since Py3k will only have Unicode literals, the problems with string literals will go away all by themselves :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).
M.-A. Lemburg wrote: > I don't follow you here. The source code encoding > is only applied to Unicode literals (you are using string > literals in your example). String literals are passed > through as-is. however, for Python 3000, it would be nice if the source-code encoding applied to the *entire* file (XML-style), rather than just unicode string literals and (hope- fully) comments and docstrings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com