Re: [nyphp-talk] Any alternatives to mbstring for PHP+UTF-8?

Paul Houle Thu, 10 May 2007 18:15:05 -0700

Jakob Buchgraber wrote:

Hey!
I was wondering whether there are alternatives to mbstring forhandling UTF-8 encoded data with PHP?I am asking, because I'd like to play around with as many"technologies" as possible before I actually start developing.I somehow also looked at the way Joomla! did it, but I don't reallylike their solution.

Sometimes you can process UTF-8 without doing anything special. Forinstance, if you want to pull some text out of a MySQL database anddisplay it on a web page, you can pass the UTF-8 text through withoutusing mbstring in PHP: the one thing you need to do is set thecharacter encoding of the HTML document to UTF-8.

A big strength of UTF-8 is that UTF-8 is compatible with US-ASCII;all US-ASCII characters are the same in UTF-8. This means that you canexplode on ",", "\t", "\n" or a space just like you always do.

Any regex on Unicode 'characters' can be translated to a regex thatworks on UTF-8 bytes. This may be awkwards sometimes, but it can be anefficient way to do many operations, including those that "get underthe hood" of your language.

Avoid unnecessary character conversions. If you can take UTF-8 in,process it as UTF-8, and output UTF-8, that's really the best. Peoplewho work with languages like Java, that do character conversions foryou, often find they're not in control of their character conversions.Years ago I discovered that the contents of a postgres database weredouble-encoded... The bytes that made up the first UTF-8 encoding weretreated as iso-latin-1 characters, and re-encoded in Unicode... Ifyou're working with Unicode, you'll probably need to deal with problemslike this from time to time.

The main weakness of UTF-8 is that it's a variable-length encoding.That means it's hard to pick out the N'th character of a string.mbstring has a function that lets you do this, but be careful how youuse it. Getting the N'th character of a UTF-8 string is an O(N)operation, and iterating over the whole string is O(N^2)... Ouch.Efficient algorithms for UTF-8 tend to work sequentially -- and quite afew of them can be translated to string algorithms over the bytes.

There's no substitute for understanding how Unicode and UTF-8 andrelated representations work -- if you work with it enough, you'll seeall kinds of malformed text and you'll need to be able to deal with it.


_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Re: [nyphp-talk] Any alternatives to mbstring for PHP+UTF-8?

Reply via email to