Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values.
------------------------------ Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Fri, Apr 27, 2012 at 11:21, Doug Ewell <d...@ewellic.org> wrote: > Cristian Secară <orice at secarica dot ro> wrote: > > > It turned out that they (ETSI & its groups) created a way to solve the > > 70 characters limitation, namely “National Language Single Shift” and > > “National Language Locking Shift” mechanism. This is described in 3GPP > > TS 23.038 standard and it was introduced since release 8. In short, it > > is about a character substitution table, per character or per message, > > per-language defined. > > > > Personally I find this to be a stone-age-like approach, which in my > > opinion does not work at all if I enter the message from my PC > > keyboard via the phone's PC application (because the language cannot > > always be predicted, mainly if I am using dead keys). It is true that > > the actual SMS stream limit is not much generous, but I wonder if the > > SCSU would have been a better approach in terms of i18n. I also don't > > know if the SCSU requires a language to be prior declared, or it > > simply guess by itself the required window for each character. > > I agree that treating character repertoire as simply a matter of > language selection, and creating language-specific code pages, is a > backward-looking solution. Not only is language tagging not always an > option, as Cristian points out, but people don't want to be tied to the > absolute minimum character repertoire that someone decided was necessary > to write a given language, even in a text message. Just look at the rise > of emoji in text messages. > > And, of course, I agree that SCSU would have been a much better > solution. Most of the current arguments against SCSU wouldn't apply to > SMS: the cross-site scripting argument wouldn't apply if SCSU were the > only "extended" encoding, or if the protocol tagged it, and the > complex-encoder argument wouldn't apply to any phone from the last 5 > years that can take pictures and shoot videos and scan bar codes and run > numerous apps simultaneously. (SCSU doesn't require a complex encoder > anyway, although it can benefit incrementally from one.) > > Interestingly, one of the first mentions I can find on the Unicode list > of SCSU-like compression — actually a description of RCSU, the > predecessor to SCSU — was in the context of SMS message compression: > > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html > > Neither RCSU nor SCSU quite fits the original bill, which was to > represent Unicode in 7 bits per character (with some overhead) and thus > achieve 160 characters per message. Both schemes use 8-bit code units. > Still, 140 characters is much better than 70. > > > Apparently the SCSU seems to be ok for my language, or Hungarian, or > > Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic > > scripts ? This versus the language shift mechanism, which is still 7 > > bit. Release 10 of that standard includes language locking shift > > tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada, > > Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu. > > SCSU works equally well, or almost so, with any text sample where the > non-ASCII characters fit into a single block of 128 code points. For > anything other than Latin-1 you need one byte of overhead, to switch to > another window, and for many scripts you need two, to define a window > and switch to it. But again, two bytes is not what's holding anyone up. > > -- > Doug Ewell | Thornton, Colorado, USA > http://www.ewellic.org | @DougEwell > > > > >