Re: [OT] Effect of UTF-8 on 2G connections

Joakim via Digitalmars-d Wed, 01 Jun 2016 09:51:13 -0700

On Wednesday, 1 June 2016 at 14:58:47 UTC, Marco Leise wrote:

Am Wed, 01 Jun 2016 13:57:27 +0000
schrieb Joakim <dl...@joakim.fea.st>:
No, I explicitly said not the web in a subsequent post. Theignorance here of what 2G speeds are like is mind-boggling.
I've used 56k and had a phone conversation with my sister whileshe was downloading a 800 MiB file over 2G. You just learn tobe patient (or you already are when the next major city ishundreds of kilometers away) and load only what you need. Yourpoint about the costs convinced me more.

I see that max 2G speeds are 100-200 kbits/s. At that rate, itwould have taken her more than 10 hours to download such a largefile, that's nuts. The worst part is when the download getsinterrupted and you have to start over again because mostdownload managers don't know how to resume, including the stockone on Android.

Also, people in these countries buy packs of around 100-200 MBfor 30-60 US cents, so they would never download such a largefile. They use messaging apps like Whatsapp or WeChat, whichnobody in the US uses, to avoid onerous SMS charges.

Here is one article spiced up with numbers and figures:http://www.thequint.com/technology/2016/05/30/almost-every-indian-may-be-online-if-data-cost-cut-to-one-third

Yes, only the middle class, which are at most 10-30% of thepopulation in these developing countries, can even afford 2G.The way to get costs down even further is to make the tech asefficient as possible. Of course, much of the rest of thepopulation are illiterate, so there are bigger problems there.

But even if you could prove with a study that UTF-8 caused a
notable bandwith cost in real life, it would - I think - be a
matter of regional ISPs to provide special servers and apps
that reduce data volume.


Yes, by ditching UTF-8.

There is also the overhead of
key exchange when establishing a secure connection:
http://stackoverflow.com/a/20306907/4038614
Something every app should do, but will increase bandwidth use.

That's not going to happen, even HTTP/2 ditched that requirement.Also, many of those countries' govts will not allow it: googlehow Blackberry had to give up their keys for "secure" BBM in manycountries. It's not just Canada and the US spying on theircitizens.

Then there is the overhead of using XML in applications
like WhatsApp, which I presume is quite popular around the
world. I'm just trying to broaden the view a bit here.

I didn't know they used XML. Googling it now, I see mention thatthey switched to an "internally developed protocol" at somepoint, so I doubt they're using XML now.

This note from the XMPP that WhatsApp and Jabber use will make
you cringe: https://tools.ietf.org/html/rfc6120#section-11.6

Haha, no wonder Jabber is dead. :) I jumped on Jabber for my ownmessages a decade ago, as it seemed like an open way out of thatproprietary messaging mess, then I read that they're using XMLand gave up on it.


On Wednesday, 1 June 2016 at 15:02:33 UTC, Wyatt wrote:

On Wednesday, 1 June 2016 at 13:57:27 UTC, Joakim wrote:
No, I explicitly said not the web in a subsequent post. Theignorance here of what 2G speeds are like is mind-boggling.
It's not hard. I think a lot of us remember when a 14.4 modemwas cutting-edge.

Well, then apparently you're unaware of how bloated web pages arenowadays. It used to take me minutes to download popular webpages _back then_ at _top speed_, and those pages were a _lot_smaller.

Codepages and incompatible encodings were terrible then, too.

Never again.

This only shows you probably don't know the difference between anencoding and a code page, which are orthogonal concepts inUnicode. It's not surprising, as Walter and many othersresponding show the same ignorance. I explained this repeatedlyin the previous thread, but it depends on understanding the tech,and I can't spoon-feed that to everyone.

Well, when you _like_ a ludicrous encoding like UTF-8, notsure your opinion matters.
It _is_ kind of ludicrous, isn't it? But it really is theleast-bad option for the most text. Sorry, bub.


I think we can do a lot better.

No. The common string-handling use case is code that isunaware which script (not language, btw) your text is in.
Lol, this may be the dumbest argument put forth yet.
This just makes it feel like you're trolling. You're not justtrolling, right?


Are you trolling?  Because I was just calling it like it is.

The vast majority of software is written for _one_ language, thelocal one. You may think otherwise because the software thatsells the most and makes the most money is internationalizedsoftware like Windows or iOS, because it can be resold into manymarkets. But as a percentage of lines of code written, suchinternational code is almost nothing.

I don't think anyone here even understands what a goodencoding is and what it's for, which is why there's no pointin debating this.
And I don't think you realise how backwards you sound to peoplewho had to live through the character encoding hell of thepast. This has been an ongoing headache for the better part ofa century (it still comes up in old files, sites, and systems)and you're literally the only person I've ever seen seriouslysuggest we turn back now that the madness has been somewhattamed.

No, I have never once suggested "turning back." I have suggesteda new scheme that retains one technical aspect of the priorschemes, ie constant-width encoding for each language, with asingle byte sufficing for most. _You and several others_,including Walter, see that and automatically translate that to,"He wants EBCDIC to come back!," as though that were the onlypossible single-byte encoding and largely ignoring thepossibilities of the header scheme I suggested.

I could call that "trolling" by all of you, :) but I'll insteadcall it what it likely is, reactionary thinking, and move on.

If you have to deal with delivering the fastest possible i18nat GSM data rates, well, that's a tough problem and it soundslike you might need to do something pretty special. Turning theentire ecosystem into your special case is not the answer.

I don't think you understand: _you_ are the special case. The 5billion people outside the US and EU are _not the special case_.Yes, they have not mattered so far, because they were too poor tobuy computers. But the "computers" with the most sales thesedays are smartphones, and Motorola just launched their new MotoG4 in India and Samsung their new C5 and C7 in China. Theydidn't bother announcing release dates for these mid-rangephones- well, they're high-end in those countries- in the US.That's because "computer" sales in all these non-ASCII countriesnow greatly outweighs the US.

Now, a large majority of people in those countries don't havesmartphones or text each other, so a significant chunk of theminority who do buy mostly ~$100 smartphones over there canlikely afford a fatter text encoding and I don't know whatencodings these developing markets are commonly using now. Theproblem is all the rest, and those just below who cannot affordit at all, in part because the tech is not as efficient as itcould be yet. Ditching UTF-8 will be one way to make it moreefficient.


On Wednesday, 1 June 2016 at 16:15:15 UTC, Patrick Schluter wrote:

Indeed, Joakim's proposal is so insane it beggars belief (whynot go back to baudot encoding, it's only 5 bit, hurray, it'sso much faster when used with flag semaphores).


I suspect you don't understand my proposal.

As a programmer in the European Commission translation unit,working on the probably biggest translation memory in the worldfor 14 years, I can attest that Unicode is a blessing. When Iremember the shit we had in our documents because of the codepages before most programs could handle utf-8 or utf-16 (andbefore 2004 we only had 2 alphabets to take care of, Westernand Greek). What Joakim does not understand, is that there arehuge, huge quantities of documents that are multi-lingual.

Oh, I'm well aware of this. I just think a variable-lengthencoding like UTF-8 or UTF-16 is a bad design. And what you haveto realize is that most strings in most software will only haveone language. Anyway, the scheme I sketched out handles multiplelanguages: it just doesn't optimize for completely random jumblesof characters from every possible language, which is what UTF-8is optimized for and is a ridiculous decision.

Translators of course handle nearly exclusively with at leastbi-lingual documents. Any document encountered by a translatormust at least be able to present the source and the targetlanguage. But even outside of that specific population,multilingual documents are very, very common.

You are likely biased by the fact that all your documents arebilingual: they're _not_ common for the vast majority of users.Even if they were, UTF-8 is as suboptimal, compared to theconstant-width encoding scheme I've sketched, for bilingual oreven trilingual documents as it is for a single language, so evenif I were wrong about their frequency, it wouldn't matter.

Re: [OT] Effect of UTF-8 on 2G connections

Reply via email to