On Oct 7, 7:13 pm, Xah Lee <xah...@gmail.com> wrote: > here's my experiences dealing with unicode in various langs. > > Unicode Support in Ruby, Perl, Python, Emacs Lisp > > Xah Lee, 2010-10-07 > > I looked at Ruby 2 years ago. One problem i found is that it does not > support Unicode well. I just checked today, it still doesn't. Just do > a web search on blog and forums on “ruby unicode”. e.g.: Source, > Source, Source, Source. > > Perl's exceedingly lousy unicode support hack is well known. In fact > it is the primary reason i “switched” to python for my scripting needs > in 2005. (See: Unicode in Perl and Python) > > Python 2.x's unicode support is also not ideal. You have to declare > your source code with header like 「#-*- coding: utf-8 -*-」, and you > have to declare your string as unicode with “u”, e.g. 「u"林花謝了春紅"」. In > regex, you have to use unicode flag such as 「re.search(r'\.html > $',child,re.U)」. And when processing files, you have to read in with > 「unicode(inF.read(),'utf-8')」, and printing out unicode you have to > do「outF.write(outtext.encode('utf-8'))」. If you are processing lots of > files, and if one of the file contains a bad char or doesn't use > encoding you expected, your python script chokes dead in the middle, > you don't even know which file it is or which line unless your code > print file names. > > Also, if the output shell doesn't support unicode or doesn't match > with the encoding specified in your python print, you get gibberish. > It is often a headache to figure out the locale settings, what > encoding the terminal support or is configured to handle, the encoding > of your file, the which encoding the “print” is using. It gets more > complex if you are going thru a network, such as ssh. (most shells, > terminals, as of 2010-10, in practice, still have problems dealing > with unicode. (e.g. Windows Console, PuTTY. Exception being Mac's > Apple Terminal.)) > > Python 3 supposedly fixed the unicode problem, but i haven't used it. > Last time i looked into whether i should adopt python 3, but > apparently it isn't used much. (See: Python 3 Adoption) (and i'm quite > pissed that Python is going more and more into OOP mumbo jumbo with > lots ad hoc syntax (e.g. “views”, “iterators”, “list comprehension”.)) > > I'll have to say, as far as text processing goes, the most beautiful > lang with respect to unicode is emacs lisp. In elisp code (e.g. > Generate a Web Links Report with Emacs Lisp ), i don't have to declare > none of the unicode or encoding stuff. I simply write code to process > string or buffer text, without even having to know what encoding it > is. Emacs the environment takes care of all that. > > It seems that javascript and PHP also support unicode well, but i > don't have extensive experience with them. I suppose that elisp, php, > javascript, all support unicode well because these langs have to deal > with unicode in practical day-to-day situations. > > -------------------------------------------------- > for links, > seehttp://xahlee.blogspot.com/2010/10/unicode-support-in-ruby-perl-pytho... > > Xah ∑ xahlee.org ☄
Maybe you have checked wrong version. There two versions of Ruby out there one does support unicode and the other doesn't. Latest version ie. 1.9.x branch has made some progress in that regard. Please check the following links to see if the solve your problem. http://nuclearsquid.com/writings/ruby-1-9-encodings.html http://loopkid.net/articles/2008/07/07/ruby-1-9-utf-8-mostly-works http://stackoverflow.com/questions/1627767/rubys-stringgsub-unicode-and-non-word-characters I think latest recommended version of Ruby is ruby 1.9.2p0, please try it to see if it works for you. Of course it is not as good as Lisp, and in Rails code you see people writing the same sequences of characters over and over again, but some people like it because it is better than other languages they used before. If it's a stepping stone towards Lisp then it is a good thing imho. -- http://mail.python.org/mailman/listinfo/python-list