Re: django unicode-conversion, beginning
Hi gabor, I've put up some patches to help with the unicode conversion of django. We have a site which is shortly going to production where we actually have to handle multiple unicode scripts including some which have characters that do not fall into iso-8859-1. Since I'm pretty lazy and I'm not really interested in maintaining my own set of unicode patches against django forever - I'm *very* interested in helping with any effort to get Django to support unicode. Adrian - can we get that branch opened up soon? vic On 8/21/06, gabor <[EMAIL PROTECTED]> wrote: > > Adrian Holovaty wrote: > > On 8/8/06, gabor <[EMAIL PROTECTED]> wrote: > >> i think unicodizing django can be done in 4 easily separated steps/parts: > >> > >> 1. request/response > >> 2. templating-system > >> 3. database-system > >> 4. "overall unicode-conversion". this is mostly about replacing > >> bytestrings with u"bla" in the code, and switching __str__ to __unicode__ > >> > >> my biggest problem currently is, that i do not know how to continue... > >> should i just write more and more patches to increase the > >> "unicode-coverage" to more parts of django? or maybe a more coordinated > >> approach would be better? > > > > Hey gabor, > > > > Sorry for the slow response on this -- I'm just now wading through a > > couple of weeks' worth of django-users and django-developers messages. > > This patch is a great step forward! > > > > Are you interested in a Subversion branch devoted to Unicoding Django? > > Let me know... > > > > (to make sure my original response is not caught up in a spam-filter or > such, sending this to the list too) > > > hi, > > > yes, i'm interested :) > > cannot really promise how long it will take to convert the whole django > to unicode, but will try. it's not hard. as i wrote, the changes are > simple, it's just that many changes have to be done. > > > thanks, > gabor > > > > -- "Never attribute to malice that which can be adequately explained by stupidity." - Hanlon's Razor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Adrian Holovaty wrote: > On 8/8/06, gabor <[EMAIL PROTECTED]> wrote: >> i think unicodizing django can be done in 4 easily separated steps/parts: >> >> 1. request/response >> 2. templating-system >> 3. database-system >> 4. "overall unicode-conversion". this is mostly about replacing >> bytestrings with u"bla" in the code, and switching __str__ to __unicode__ >> >> my biggest problem currently is, that i do not know how to continue... >> should i just write more and more patches to increase the >> "unicode-coverage" to more parts of django? or maybe a more coordinated >> approach would be better? > > Hey gabor, > > Sorry for the slow response on this -- I'm just now wading through a > couple of weeks' worth of django-users and django-developers messages. > This patch is a great step forward! > > Are you interested in a Subversion branch devoted to Unicoding Django? > Let me know... > (to make sure my original response is not caught up in a spam-filter or such, sending this to the list too) hi, yes, i'm interested :) cannot really promise how long it will take to convert the whole django to unicode, but will try. it's not hard. as i wrote, the changes are simple, it's just that many changes have to be done. thanks, gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 20-aug-2006, at 8:55, Malcolm Tredinnick wrote: >> 5. Internally, work with unicode strings exclusively (after >> transcoding the request and the template). Response should be python >> unicode as well up until the moment it gets sent out. > > That's the idea. Not so fast. You want to be liberal and send out BIG5 and JIS output, but at the same time use Unicode strings on the inside. How are you going to represent the characters which you want to preserve and handle specially with these Asian encodings if all you have in the machinery is Unicode? If you can't handle these characters then what is the point of having switchable output and input? Are there browsers that don't handle UTF-8? I mean, modern ones. Even Lynx does it properly. How are you going to encodiUriCompnents in JS with other charsets? Encode URIs? > Metaphorically cutting off both our arms so that we appear > more aerodynamic is probably not a gain worth making. I don't agree, but I rest my case. I just thought UTF-8 is the optimum compromise and enough non-conformity already. Thought that Django can be one of those frameworks that cut the knot instead of spending weeks unwinding it. -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: Re: django unicode-conversion, beginning
On 8/20/06, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote: > Metaphorically cutting off both our arms so that we appear > more aerodynamic is probably not a gain worth making. That's going in my quotes file. -- "May the forces of evil become confused on the way to your house." -- George Carlin --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Malcolm Tredinnick wrote: > Metaphorically cutting off both our arms so that we appear > more aerodynamic is probably not a gain worth making. This is the explanation! :-) >> 5. Internally, work with unicode strings exclusively (after >> transcoding the request and the template). Response should be python >> unicode as well up until the moment it gets sent out. > > That's the idea. It really works like this already by accepting unicode and also StringIO buffers with unicode. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On Sun, 2006-08-20 at 07:15 +0200, Julian 'Julik' Tarkhanov wrote: > > On 17-aug-2006, at 1:08, Bill de hÓra wrote: > > > like wanting to serve utf8 rss feeds, but have latin1 come > > in and out of mysql. > > Might seem very extreme, but I would love to chime in. Maybe it would > be wise to go even further, whereby: > > 1. Hardcode Django to output and input UTF-8 as the most useful for > interop Huge -1. This stuff (output encoding) has to be configurable, it's the way the Internet works. Sure, there are a bunch of cases where the specs will be inconclusive or ignored, and then we will need to make inspired choices, just like every other data-consuming, network-based application. But the whole planet has not standardised on UTF-8 and with valid reasons. It's also not that hard to get right, albeit fairly fiddly. You identify the interfaces between external data and Django and do the conversion to unicode as soon as you can. That's the process Gabor is going through at the moment. Metaphorically cutting off both our arms so that we appear more aerodynamic is probably not a gain worth making. > 1a. Any case where the developer might expect different input (for > instance almost all OPML files are still exported as ISO due to > idyosyncrastic way Radio worked back in the day) has to be known to > him and handled explicittly > 1b. Honor the charset headers sent in the request for transcoding > 1c. Allow everyone who wants to output other charsets to cry and perish. > 2. Stick the utf-8 output charset anywhere where it's possible > (headers, page head...). Since non-UTF-8 encodings are the norm in a lot of East-Asian locales (both for cultural and technical reasons), this isn't going to work. > 5. Internally, work with unicode strings exclusively (after > transcoding the request and the template). Response should be python > unicode as well up until the moment it gets sent out. That's the idea. [...] > I know, it seems so nice to be liberal and allow people to choose > their encoding but just too many situations prove that to be the > Wrong Choice. Th combined citizenry of China, Japan and South Korea thank your for your input, but respectfully point out that you are mistaken. Regards, Malcolm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 17-aug-2006, at 1:08, Bill de hÓra wrote: > like wanting to serve utf8 rss feeds, but have latin1 come > in and out of mysql. Might seem very extreme, but I would love to chime in. Maybe it would be wise to go even further, whereby: 1. Hardcode Django to output and input UTF-8 as the most useful for interop 1a. Any case where the developer might expect different input (for instance almost all OPML files are still exported as ISO due to idyosyncrastic way Radio worked back in the day) has to be known to him and handled explicittly 1b. Honor the charset headers sent in the request for transcoding 1c. Allow everyone who wants to output other charsets to cry and perish. 2. Stick the utf-8 output charset anywhere where it's possible (headers, page head...). 2. Allow the DB to be in another encoding for databases that support it. For instance, MySQL and Postgress will transcode the strings for the client on the fly, so you can do interop with them in UTF-8 even when they are in a different encoding. 3. Assume all templates are in UTF-8 as well because text editors have much more success dealing with it them that way. Transcode templates on read into unicode strings. 4. As a consequence of 1, let DEFAULT_CHARSET go. Too many choices really hurt here. 5. As a consequence of 1, deprecate the DATABASE_CHARSET I sent in as a patch and make it the default, so that all drivers switch their database clients to the most suitable Unicode form. SQLite has to be compiled with Unicode support, this has to be mentioned in the docs. 5. Internally, work with unicode strings exclusively (after transcoding the request and the template). Response should be python unicode as well up until the moment it gets sent out. Important to note is that every database driver has to be scrutinized for whether it returns unicode strings proper. I know, it seems so nice to be liberal and allow people to choose their encoding but just too many situations prove that to be the Wrong Choice. -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
In China GB18030 is required to be used by law, any most sites just assume the browser uses that as the default, so they don't even specify a character encoding. Your likely setup for international web sites is to have Unicode in the database (since databases have special support for it and it is a good base encoding), but to serve up different encodings wherever UTF-8 proves problematic (for technical or legal reasons). Hopefully, over time, there'll be less and less resistance to using UTF-8. Rgds, Bjorn --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
gabor wrote: > > currently my plan is to have the following behaviour: > > 1. i assume that every GET/POST param comes in encoded as > settings.DEFAULT_CHARSET, and will decode it accordingly. if it fails, > then it fails. Assuming "you got served" with settings.DEFAULT_CHARSET, then sure. > 3. will assume the database is in DEFAULT_CHARSET > - maybe can we somehow ask the db for it's charset? It would be a start. > so, what do you think? > or should we make it possible to have a system with mixed charsets? I could imagine serving web content with one encoding, but lumping things in and out of the db with another.I guess people will need mixed encodings - like wanting to serve utf8 rss feeds, but have latin1 come in and out of mysql. But so long as we sweep out bytestrings inside django for unicode objects, mixed i/o should be possible to add on later. Would being able to spec the db char encoding via settings.py be a needed option, or is that even possible across databases? cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 8/16/06, gabor <[EMAIL PROTECTED]> wrote: > 3. will assume the database is in DEFAULT_CHARSET > - maybe can we somehow ask the db for it's charset? I think you really have to allow for different charset in the DB-- legacy integration, remember. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Jeremy Dunck wrote: > I hearby degree that all strings in computing should have a charset > associated with them. > > ... > > Damn, it didn't work. ROTFL! On a more positive note, kudos to Gábor for looking at this. Gábor, if you get a dev branch, I'll be happy work against it. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
gabor wrote: > 3. will assume the database is in DEFAULT_CHARSET > - maybe can we somehow ask the db for it's charset? > > so, what do you think? > or should we make it possible to have a system with mixed charsets? > (well, maybe having a different DB_CHARSET and a DEFAULT_CHARSET could > work. maybe) Yes, this is very desirable for systems that use legacy DB but want to output good modern utf-8 for users. There is a ticket implementing this setting (http://code.djangoproject.com/ticket/952) but now I see that Jacob has marked it WONTFIX. And even after reading an explanation I don't get the reason... This still looks like a requirement for unicodification rather than something that will be superseded. Jacob, could you elaborate on this? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Jeremy Dunck wrote: > On 8/16/06, Bill de hÓra <[EMAIL PROTECTED]> wrote: >> Now. Most (all?) browser UAs sniff the content to second guess the media >> type. They don't much pay attention to Content-Type (I think maybe IE >> ignores it altogether). The problem for this example is they might be >> doing something similar for character encodings declared on the form >> page's GET request. Browsers do this because so much served content is >> mislabelled (eg feeds served as text/html and video as text/plain). > > IE doesn't totally ignore it. I just does some horrible, wrong things > while considering it. > http://blogs.msdn.com/ie/archive/2005/02/01/364581.aspx > http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp > > Ian Hickson says contenttype is dead: > http://ln.hixie.ch/?start=1144794177&count=1 > http://ln.hixie.ch/?start=1154950069&count=1 > hmmm.. sad to hear that.. but it hopefully does not affect the django-unicode issue too much... currently my plan is to have the following behaviour: 1. i assume that every GET/POST param comes in encoded as settings.DEFAULT_CHARSET, and will decode it accordingly. if it fails, then it fails. - might make an exception and in case of post-data check the content-type header of the request, whether it contains any charset stuff -if you really-really-really need to do some crazy is-sent-as-foo-but-has-to-be-treated-as-bar, you can always use the raw-postdata and raw-getdata. 2. will render the template in DEFAULT_CHARSET 3. will assume the database is in DEFAULT_CHARSET - maybe can we somehow ask the db for it's charset? so, what do you think? or should we make it possible to have a system with mixed charsets? (well, maybe having a different DB_CHARSET and a DEFAULT_CHARSET could work. maybe) gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 8/9/06, gabor <[EMAIL PROTECTED]> wrote: > hmmm.. are you sure that the situation with unicode-aware editors is so bad? > > could you name some non-unicode-aware editors? > for me it seems that from notepad through vim to eclipse everything does > unicode fine... On Windows, I used UltraEdit, which is a very popular editor. $25ish with very nice features. It claims to support unicode, but I've tested with it and it horribly mangles anything but UTF-8. Worse, you can open a UTF-8 file as though it were ASCII, then save as unicode, causing double-encoding. I hearby degree that all strings in computing should have a charset associated with them. ... Damn, it didn't work. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 8/16/06, Bill de hÓra <[EMAIL PROTECTED]> wrote: > Now. Most (all?) browser UAs sniff the content to second guess the media > type. They don't much pay attention to Content-Type (I think maybe IE > ignores it altogether). The problem for this example is they might be > doing something similar for character encodings declared on the form > page's GET request. Browsers do this because so much served content is > mislabelled (eg feeds served as text/html and video as text/plain). IE doesn't totally ignore it. I just does some horrible, wrong things while considering it. http://blogs.msdn.com/ie/archive/2005/02/01/364581.aspx http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp Ian Hickson says contenttype is dead: http://ln.hixie.ch/?start=1144794177&count=1 http://ln.hixie.ch/?start=1154950069&count=1 Happily, Mark Pilgrim did a lot of the hard work by converting Mozilla's charset detection routines to Python in support of his feed parser. http://chardet.feedparser.org/docs/how-it-works.html --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Gábor Farkas wrote: > for example, using this html file: > > http://localhost:7000";> > > > > (+ additional xhtml-headers, http-equiv-content-type=utf-8 etc) > > firefox submits this: > > > POST / HTTP/1.1 > Host: localhost:7000 > User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b1) > Gecko/20060601 BonEcho/2.0b1 (Ubuntu-edgy) > Accept: > text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 > Accept-Language: en-us,en;q=0.5 > Accept-Encoding: gzip,deflate > Accept-Charset: UTF-8,* > Keep-Alive: 300 > Connection: keep-alive > Cookie: sessionid=9f5f5a5c387a07dd6b7e4d34a04e38b9 > Content-Type: application/x-www-form-urlencoded > Content-Length: 14 > > gabor1=farkas1 > = > > so, in what charset is the POSTDATA? I don't have good news for you. If we are talking about HTML forms in this case - undefined. There's no charset attribute defined on the form. In that case the value is assumed to be "unknown" and clients can (not must) map this value as the character encoding that was used to send the html form. You can't assume in ISO-8859-1 for a form as that only comes to as a default for text/* types. > so, i agree with you, that if they do send it, we should honor it. but > they are not sending it (i assume they should send it in the > Content-Type header). To spec? Then client UAs /must/ treat the Content-Type header as the authoritative declaration of the character encoding if there is a charset - it overrides *everything*. HTTP 1.1 and recent W3C findings are explicit on this. Now. Most (all?) browser UAs sniff the content to second guess the media type. They don't much pay attention to Content-Type (I think maybe IE ignores it altogether). The problem for this example is they might be doing something similar for character encodings declared on the form page's GET request. Browsers do this because so much served content is mislabelled (eg feeds served as text/html and video as text/plain). So the heuristic "browsers send content back in the encoding they receive it" can be assumed in, but you have to allow for cases where they are sniffing content and ignoring server directives. But, as a server implementor, my advice is to *always* send the Content-Type header and charset, and assume the data will be returned in that encoding. In order to be as stateless as possible, that means serving all forms in the same encoding, and typically your best bet in that case is to serve as UTF-8. Serving latin1 might work also for cases where people are using keyboard shorts for things like my surname (I'd need to test this to be sure; all I can say after 10 years of shopping online is that it's been pot luck). For cut and pasted content from word, we'd need to transcode down from cp1252 to latin1. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Bill de hÓra wrote: > gabor wrote: > >> so what do you think about the following approach: >> >> try ascii-decoding >> if fails, try utf8-decoding >> if fails do iso-8859-1-decoding (this cannot fail). >> >> ? > > Dumb question maybe. How do you know this encoding ladder will work? it depends on how you define 'will work' :-) it will not fail (every string can be decoded as iso-8859-1). > >> but imho this should happen only in "special" cases like >> environ-variables.. for example in get/post params i would prefer to >> raise an exception when the data cannot be en/de-coded using the >> configured charset. > > You'd need to honor charset parameters sent out of Django apps and sent > back by the client. A sensible default encoding to emit is UTF-8. i would honor them if they would be sent :-) for example, using this html file: http://localhost:7000";> (+ additional xhtml-headers, http-equiv-content-type=utf-8 etc) firefox submits this: POST / HTTP/1.1 Host: localhost:7000 User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b1) Gecko/20060601 BonEcho/2.0b1 (Ubuntu-edgy) Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: UTF-8,* Keep-Alive: 300 Connection: keep-alive Cookie: sessionid=9f5f5a5c387a07dd6b7e4d34a04e38b9 Content-Type: application/x-www-form-urlencoded Content-Length: 14 gabor1=farkas1 = so, in what charset is the POSTDATA? so, i agree with you, that if they do send it, we should honor it. but they are not sending it (i assume they should send it in the Content-Type header). the only usable assumption i have found up to now is that the browsers sends the data back encoded in the submitting-html-page's charset. or is there a better way? gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Malcolm Tredinnick wrote: > On Wed, 2006-08-09 at 21:51 +0200, gabor wrote: > [...] >> phew... the immortal >> how-tolerant-we-should-be-when-doing-unicode-conversion problems :-) > > Agreed. This is much easier on my side of the fence (lobbing problems), > than your side (solving them). > [...] > All that being said, you could start off implementing your list and go > from there (although surely utf-8 decoding will also handle ASCII > strings, so you could skip the first step). These would be good rules to follow: - use unicode objects internally, weed out encoded bytestrings. - decode all loaded files and configuration into unicode; templates will be challenging. - initially at least, add assertions enforcing the use of unicode parameters (crash when you see a bytestring being passed into unicode aware code or across applications) - default encode to utf8 at server boundaries, modulo what Malcolm said about honoring charsets served out. - default de/encode in and out of utf8 for storage inside databases; it might be not possible and it might require a declaration in settings. - have the admin app strip out cp1252 to deal with cut and paste from windows; effbot has a dictionary that can be used for this. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
gabor wrote: > so what do you think about the following approach: > > try ascii-decoding > if fails, try utf8-decoding > if fails do iso-8859-1-decoding (this cannot fail). > > ? Dumb question maybe. How do you know this encoding ladder will work? > but imho this should happen only in "special" cases like > environ-variables.. for example in get/post params i would prefer to > raise an exception when the data cannot be en/de-coded using the > configured charset. You'd need to honor charset parameters sent out of Django apps and sent back by the client. A sensible default encoding to emit is UTF-8. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 8/8/06, gabor <[EMAIL PROTECTED]> wrote: > i think unicodizing django can be done in 4 easily separated steps/parts: > > 1. request/response > 2. templating-system > 3. database-system > 4. "overall unicode-conversion". this is mostly about replacing > bytestrings with u"bla" in the code, and switching __str__ to __unicode__ > > my biggest problem currently is, that i do not know how to continue... > should i just write more and more patches to increase the > "unicode-coverage" to more parts of django? or maybe a more coordinated > approach would be better? Hey gabor, Sorry for the slow response on this -- I'm just now wading through a couple of weeks' worth of django-users and django-developers messages. This patch is a great step forward! Are you interested in a Subversion branch devoted to Unicoding Django? Let me know... Adrian -- Adrian Holovaty holovaty.com | djangoproject.com --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On 8/10/06, Ivan Sagalaev <[EMAIL PROTECTED]> wrote: > > Malcolm Tredinnick wrote: > > I completely agree this is painful and normally I would punt. But my > > crystal ball tells me that you will then get bug reports from Mr > > Sagalaev, who is generally both very diligent in his debugging and likes > > to use some language with a funny alphabet. If whatever you come up with > > works naturally in places like Ivan's setup and maybe somebody who lives > > in Hong Kong or Japan or some other East Asian locale, you could > > consider this "solved" to some extent. > > I'm afraid I'm not very good tester with this exact problem. Python on > my Ubuntu happily says 'UTF-8' when asked > 'locale.getpreferredencoding()'. But indeed I can always try these > things with my compatriots using Windows or configuring their linuxes > with old single-byte 'KOI8-R'. > > In fact I was under impression that a string returned from this function > can be safely used for decoding. For example on Russian Windows it > returns 'cp1251' which works perfectly well while not being a standard > ISO name which is 'windows-1251' and works well also. > > So may be we can just rely on Python's smart little brain and do > something like this: > In python Lib/encodings/aliases.py, you would find the encoding name mapping table. -- I like python! My Blog: http://www.donews.net/limodou My Django Site: http://www.djangocn.org NewEdit Maillist: http://groups.google.com/group/NewEdit --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
gabor wrote: > hmmm.. are you sure that the situation with unicode-aware editors is so bad? > > could you name some non-unicode-aware editors? > for me it seems that from notepad through vim to eclipse everything does > unicode fine... Ok, I should rephrase it. Even if most editors do support utf-8 they aren't configured to do so by default. Unfortunately there is some notion that unicode is something "new" and "scary" and "who knows what problems it will cause". So there is a case when on systems where utf-8 is not default environment setting (meaning all Windows and many Linuxes) if a programmer starts his favorite text editor odds are that it will not save a new file in utf-8. But to be sure I'll better run a poll on my forum about it... --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Malcolm Tredinnick wrote: > I completely agree this is painful and normally I would punt. But my > crystal ball tells me that you will then get bug reports from Mr > Sagalaev, who is generally both very diligent in his debugging and likes > to use some language with a funny alphabet. If whatever you come up with > works naturally in places like Ivan's setup and maybe somebody who lives > in Hong Kong or Japan or some other East Asian locale, you could > consider this "solved" to some extent. I'm afraid I'm not very good tester with this exact problem. Python on my Ubuntu happily says 'UTF-8' when asked 'locale.getpreferredencoding()'. But indeed I can always try these things with my compatriots using Windows or configuring their linuxes with old single-byte 'KOI8-R'. In fact I was under impression that a string returned from this function can be safely used for decoding. For example on Russian Windows it returns 'cp1251' which works perfectly well while not being a standard ISO name which is 'windows-1251' and works well also. So may be we can just rely on Python's smart little brain and do something like this: - try decoding from locale.getpreferredencoding() - failing that try something safe like iso-8859-1 --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
On Wed, 2006-08-09 at 21:51 +0200, gabor wrote: [...] > phew... the immortal > how-tolerant-we-should-be-when-doing-unicode-conversion problems :-) Agreed. This is much easier on my side of the fence (lobbing problems), than your side (solving them). > i generally prefer to do as little guesswork as possible, but in the > case of the environ-variables it seems we cannot avoid it.. after all, > it cannot crash when parsing the environ variables, because there's no > way from the programmer's side to affect them. > > so what do you think about the following approach: > > try ascii-decoding > if fails, try utf8-decoding > if fails do iso-8859-1-decoding (this cannot fail). I was thinking you could use the locale module to help you somewhat: locale.getdefaultlocale() and locale.getpreferredencoding() might both be useful, although experimentation is needed. For example, on my (Linux) system, getdefaultlocale() returns ('en_AU', 'utf') and I'm pretty sure 'utf' isn't an encoding (utf-8 is, utf-16 also, but not plain old utf.. :-( ). I completely agree this is painful and normally I would punt. But my crystal ball tells me that you will then get bug reports from Mr Sagalaev, who is generally both very diligent in his debugging and likes to use some language with a funny alphabet. If whatever you come up with works naturally in places like Ivan's setup and maybe somebody who lives in Hong Kong or Japan or some other East Asian locale, you could consider this "solved" to some extent. All that being said, you could start off implementing your list and go from there (although surely utf-8 decoding will also handle ASCII strings, so you could skip the first step). > but imho this should happen only in "special" cases like > environ-variables.. for example in get/post params i would prefer to > raise an exception when the data cannot be en/de-coded using the > configured charset. *Providing* what we send in the headers is that restrictive. A server can send what character set encodings it will accept in the header. The client can pick any one of those to send back. So keep that on your list of things to check (this is HTTP-level stuff). Regards, Malcolm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Malcolm Tredinnick wrote: > A couple of comments on the patch itself. I realise it's only a proof of > concept at the moment, so take as more things to think about when you > want to tidy it up: > > (1) A docstring like """needed to workaround the cgi.parse_sql > unicode-problem""" is not very future-proof. *What* parse_sql unicode > problem? How will we know if/when it goes away? Either a quick > description of the problem or a URL if it's tricky and explained > elsewhere will help people who need to read this code in six months > time. ok > > (2) You can't necessarily assume the environment is always in ASCII (or > maybe you can; see below). For example, my current locale is set to > en_AU.UTF-8 and I can do > > export foo="€50,00" > > If I'm not careful when parsing os.environ['foo'] this comes out as > rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar). > > Probably some playing around with the locale module to work out the > right behaviour and getting a few people to test things (e.g. Windows > vs. Linux vs. Macs, etc) will be necessary. It's also important not to > go too overboard here, but since arbitrary environment variables can be > set through Apache, we need to be able to work with that to be > "correct". Hmm ... what are the restrictions on what webservers can put > in their config files? Maybe ASCII-only is reasonable. *shrug* > phew... the immortal how-tolerant-we-should-be-when-doing-unicode-conversion problems :-) i generally prefer to do as little guesswork as possible, but in the case of the environ-variables it seems we cannot avoid it.. after all, it cannot crash when parsing the environ variables, because there's no way from the programmer's side to affect them. so what do you think about the following approach: try ascii-decoding if fails, try utf8-decoding if fails do iso-8859-1-decoding (this cannot fail). ? but imho this should happen only in "special" cases like environ-variables.. for example in get/post params i would prefer to raise an exception when the data cannot be en/de-coded using the configured charset. > Maybe more investigation needed here. > > (3) I know there are some software projects apparently using unicodize > as a word, but ... *shudder*. Using "code" as an analogy, "unicodify" > would be nicer (nobody uses "codize", I would hope). > ok > (4) As you go through this process, keep a list somewhere of what people > need to do to port existing applications across to using this > functionality. Ideally, the answer would be "not much" and we can cast > from the default encoding to unicode internally where necessary. But I'm > sure there will be some changes required, so keeping a list of things to > watch out for as you go will help people test this for you. > will try. gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Ivan Sagalaev wrote: > First of all, Gabor, thank you very much for doing this! > thanks :) > gabor wrote: >> today i experimented a little with the django source code, >> and here are the results. >> >> if you apply a very small patch (65lines, attached), you can write a view >> completely in unicode. >> means: >> - GET/POST contains unicode data >> - request.META contains unicode data >> - you can put unicode text into the HttpResponse (this was already possible >> without the patch) > > Here's a problem that I didn't know how to solve last time this topic > was discussed. > > You can put unicode in HttpResponse. Does it imply that template > processing should be done in unicode too? I mean, should context data > be in unicode? yes > This would be convenient later because we will get all > the data from DB in unicode also. But this poses a problem of encoding > of actual template files. > > We need to know the encoding of a template file. This can be done by > just mandating that they should be in settings.DEFAULT_CHARSET or we > should create a new setting (TEMPLATE_CHARSET). The reason of having > two different settings is that enforcing default UTF-8 in templates > means enforcing people to use unicode-aware text editors that are not > that common. hmmm.. are you sure that the situation with unicode-aware editors is so bad? could you name some non-unicode-aware editors? for me it seems that from notepad through vim to eclipse everything does unicode fine... gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
First of all, Gabor, thank you very much for doing this! gabor wrote: > today i experimented a little with the django source code, > and here are the results. > > if you apply a very small patch (65lines, attached), you can write a view > completely in unicode. > means: > - GET/POST contains unicode data > - request.META contains unicode data > - you can put unicode text into the HttpResponse (this was already possible > without the patch) Here's a problem that I didn't know how to solve last time this topic was discussed. You can put unicode in HttpResponse. Does it imply that template processing should be done in unicode too? I mean, should context data be in unicode? This would be convenient later because we will get all the data from DB in unicode also. But this poses a problem of encoding of actual template files. We need to know the encoding of a template file. This can be done by just mandating that they should be in settings.DEFAULT_CHARSET or we should create a new setting (TEMPLATE_CHARSET). The reason of having two different settings is that enforcing default UTF-8 in templates means enforcing people to use unicode-aware text editors that are not that common. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Shouldn't the UTF-8 encoding be also defined in all files as described here: http://www.python.org/dev/peps/pep-0263/ ? That is using #!/usr/bin/python # -*- coding: UTF-8 -*- at the beginning of python code files. This works pretty good at least when you need to create new instances of models containing multilingual characters via python script file. Regards, Aidas Bendoraitis [aka Archatas] On 8/9/06, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote: > > Hey Gabor, > > On Wed, 2006-08-09 at 01:03 +0200, gabor wrote: > > today i experimented a little with the django source code, > > and here are the results. > > > > if you apply a very small patch (65lines, attached), you can write a view > > completely in unicode. > > means: > > - GET/POST contains unicode data > > - request.META contains unicode data > > - you can put unicode text into the HttpResponse (this was already possible > > without the patch) > > > > of course, this patch is a demonstration only. the charset is hardcoded > > to UTF-8 (should be settings.DEFAULT_CHARSET), and it only handles the > > WSGI way (the mod_python one is not handled). also templating and ORM > > are not touched. (not to mention the ugliness of the code) > > > > but still, i was quite surprised that with such small changes so much > > can be done. > > The low-hanging fruit are definitely the place to start for this sort of > thing. > > > > > i think unicodizing django can be done in 4 easily separated steps/parts: > > > > 1. request/response > > 2. templating-system > > 3. database-system > > 4. "overall unicode-conversion". this is mostly about replacing > > bytestrings with u"bla" in the code, and switching __str__ to __unicode__ > > > > my biggest problem currently is, that i do not know how to continue... > > should i just write more and more patches to increase the > > "unicode-coverage" to more parts of django? or maybe a more coordinated > > approach would be better? > > Ultimately, getting you a svn branch to work in will probably be > easiest. Maintaining a bunch of separate patches against a rapidly > changing tree can be fairly time consuming. I'm not sure what the > procedure is for that. Adrian? > > Keeping the changes as reasonably independent as possible is a great > idea as far as you can take it. It will make review and testing a lot > easier, as well as keeping you saner because you will only have to be > looking at one layer at a time. > > A couple of comments on the patch itself. I realise it's only a proof of > concept at the moment, so take as more things to think about when you > want to tidy it up: > > (1) A docstring like """needed to workaround the cgi.parse_sql > unicode-problem""" is not very future-proof. *What* parse_sql unicode > problem? How will we know if/when it goes away? Either a quick > description of the problem or a URL if it's tricky and explained > elsewhere will help people who need to read this code in six months > time. > > (2) You can't necessarily assume the environment is always in ASCII (or > maybe you can; see below). For example, my current locale is set to > en_AU.UTF-8 and I can do > > export foo="€50,00" > > If I'm not careful when parsing os.environ['foo'] this comes out as > rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar). > > Probably some playing around with the locale module to work out the > right behaviour and getting a few people to test things (e.g. Windows > vs. Linux vs. Macs, etc) will be necessary. It's also important not to > go too overboard here, but since arbitrary environment variables can be > set through Apache, we need to be able to work with that to be > "correct". Hmm ... what are the restrictions on what webservers can put > in their config files? Maybe ASCII-only is reasonable. *shrug* > > Maybe more investigation needed here. > > (3) I know there are some software projects apparently using unicodize > as a word, but ... *shudder*. Using "code" as an analogy, "unicodify" > would be nicer (nobody uses "codize", I would hope). > > (4) As you go through this process, keep a list somewhere of what people > need to do to port existing applications across to using this > functionality. Ideally, the answer would be "not much" and we can cast > from the default encoding to unicode internally where necessary. But I'm > sure there will be some changes required, so keeping a list of things to > watch out for as you go will help people test this for you. > > Good to see somebody working on this. :-) > > Regards, > Malcolm > > > > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: django unicode-conversion, beginning
Hey Gabor, On Wed, 2006-08-09 at 01:03 +0200, gabor wrote: > today i experimented a little with the django source code, > and here are the results. > > if you apply a very small patch (65lines, attached), you can write a view > completely in unicode. > means: > - GET/POST contains unicode data > - request.META contains unicode data > - you can put unicode text into the HttpResponse (this was already possible > without the patch) > > of course, this patch is a demonstration only. the charset is hardcoded > to UTF-8 (should be settings.DEFAULT_CHARSET), and it only handles the > WSGI way (the mod_python one is not handled). also templating and ORM > are not touched. (not to mention the ugliness of the code) > > but still, i was quite surprised that with such small changes so much > can be done. The low-hanging fruit are definitely the place to start for this sort of thing. > > i think unicodizing django can be done in 4 easily separated steps/parts: > > 1. request/response > 2. templating-system > 3. database-system > 4. "overall unicode-conversion". this is mostly about replacing > bytestrings with u"bla" in the code, and switching __str__ to __unicode__ > > my biggest problem currently is, that i do not know how to continue... > should i just write more and more patches to increase the > "unicode-coverage" to more parts of django? or maybe a more coordinated > approach would be better? Ultimately, getting you a svn branch to work in will probably be easiest. Maintaining a bunch of separate patches against a rapidly changing tree can be fairly time consuming. I'm not sure what the procedure is for that. Adrian? Keeping the changes as reasonably independent as possible is a great idea as far as you can take it. It will make review and testing a lot easier, as well as keeping you saner because you will only have to be looking at one layer at a time. A couple of comments on the patch itself. I realise it's only a proof of concept at the moment, so take as more things to think about when you want to tidy it up: (1) A docstring like """needed to workaround the cgi.parse_sql unicode-problem""" is not very future-proof. *What* parse_sql unicode problem? How will we know if/when it goes away? Either a quick description of the problem or a URL if it's tricky and explained elsewhere will help people who need to read this code in six months time. (2) You can't necessarily assume the environment is always in ASCII (or maybe you can; see below). For example, my current locale is set to en_AU.UTF-8 and I can do export foo="€50,00" If I'm not careful when parsing os.environ['foo'] this comes out as rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar). Probably some playing around with the locale module to work out the right behaviour and getting a few people to test things (e.g. Windows vs. Linux vs. Macs, etc) will be necessary. It's also important not to go too overboard here, but since arbitrary environment variables can be set through Apache, we need to be able to work with that to be "correct". Hmm ... what are the restrictions on what webservers can put in their config files? Maybe ASCII-only is reasonable. *shrug* Maybe more investigation needed here. (3) I know there are some software projects apparently using unicodize as a word, but ... *shudder*. Using "code" as an analogy, "unicodify" would be nicer (nobody uses "codize", I would hope). (4) As you go through this process, keep a list somewhere of what people need to do to port existing applications across to using this functionality. Ideally, the answer would be "not much" and we can cast from the default encoding to unicode internally where necessary. But I'm sure there will be some changes required, so keeping a list of things to watch out for as you go will help people test this for you. Good to see somebody working on this. :-) Regards, Malcolm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
django unicode-conversion, beginning
today i experimented a little with the django source code, and here are the results. if you apply a very small patch (65lines, attached), you can write a view completely in unicode. means: - GET/POST contains unicode data - request.META contains unicode data - you can put unicode text into the HttpResponse (this was already possible without the patch) of course, this patch is a demonstration only. the charset is hardcoded to UTF-8 (should be settings.DEFAULT_CHARSET), and it only handles the WSGI way (the mod_python one is not handled). also templating and ORM are not touched. (not to mention the ugliness of the code) but still, i was quite surprised that with such small changes so much can be done. i think unicodizing django can be done in 4 easily separated steps/parts: 1. request/response 2. templating-system 3. database-system 4. "overall unicode-conversion". this is mostly about replacing bytestrings with u"bla" in the code, and switching __str__ to __unicode__ my biggest problem currently is, that i do not know how to continue... should i just write more and more patches to increase the "unicode-coverage" to more parts of django? or maybe a more coordinated approach would be better? because the actual conversion is not that hard. it's just that it touches a lot of parts... so it's not too deep, but very wide :-) gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~--- Index: django/http/__init__.py === --- django/http/__init__.py (revision 3538) +++ django/http/__init__.py (working copy) @@ -73,13 +73,23 @@ POST.appendlist(name_dict['name'], submessage.get_payload()) return POST, FILES + +def hacked_parse_qsl(query_string,flag): +"""needed to workaround the cgi.parse_sql unicode-problem""" +query_string = query_string.encode('ascii') +#FIXME: use settings.DEFAULT_CHARSET here +q = parse_qsl(query_string,flag) + +return [ [k.decode('utf8'),v.decode('utf8')] for (k,v) in q] + + class QueryDict(MultiValueDict): """A specialized MultiValueDict that takes a query string when initialized. This is immutable unless you create a copy of it.""" def __init__(self, query_string, mutable=False): MultiValueDict.__init__(self) self._mutable = True -for key, value in parse_qsl((query_string or ''), True): # keep_blank_values=True +for key, value in hacked_parse_qsl((query_string or ''), True): # keep_blank_values=True self.appendlist(key, value) self._mutable = mutable @@ -147,6 +157,7 @@ if cookie == '': return {} c = SimpleCookie() +cookie = cookie.encode('ascii') #fix needed for Cookie.SimpleCookie c.load(cookie) cookiedict = {} for key in c.keys(): Index: django/core/handlers/wsgi.py === --- django/core/handlers/wsgi.py(revision 3538) +++ django/core/handlers/wsgi.py(working copy) @@ -50,8 +50,19 @@ 505: 'HTTP VERSION NOT SUPPORTED', } +def unicodize_environ(environ): + +def unicodize_item(key,value): +key = key.decode('ascii') +if not key.startswith('wsgi.'): +value = value.decode('ascii') +return (key,value) + +return dict([unicodize_item(*i) for i in environ.items()]) + class WSGIRequest(http.HttpRequest): def __init__(self, environ): +environ = unicodize_environ(environ) self.environ = environ self.path = environ['PATH_INFO'] self.META = environ