Jakob Buchgraber wrote:
Hey!

I was wondering whether there are alternatives to mbstring for handling UTF-8 encoded data with PHP? I am asking, because I'd like to play around with as many "technologies" as possible before I actually start developing. I somehow also looked at the way Joomla! did it, but I don't really like their solution.

Sometimes you can process UTF-8 without doing anything special. For instance, if you want to pull some text out of a MySQL database and display it on a web page, you can pass the UTF-8 text through without using mbstring in PHP: the one thing you need to do is set the character encoding of the HTML document to UTF-8.

A big strength of UTF-8 is that UTF-8 is compatible with US-ASCII; all US-ASCII characters are the same in UTF-8. This means that you can explode on ",", "\t", "\n" or a space just like you always do.

Any regex on Unicode 'characters' can be translated to a regex that works on UTF-8 bytes. This may be awkwards sometimes, but it can be an efficient way to do many operations, including those that "get under the hood" of your language.

Avoid unnecessary character conversions. If you can take UTF-8 in, process it as UTF-8, and output UTF-8, that's really the best. People who work with languages like Java, that do character conversions for you, often find they're not in control of their character conversions. Years ago I discovered that the contents of a postgres database were double-encoded... The bytes that made up the first UTF-8 encoding were treated as iso-latin-1 characters, and re-encoded in Unicode... If you're working with Unicode, you'll probably need to deal with problems like this from time to time.

The main weakness of UTF-8 is that it's a variable-length encoding. That means it's hard to pick out the N'th character of a string. mbstring has a function that lets you do this, but be careful how you use it. Getting the N'th character of a UTF-8 string is an O(N) operation, and iterating over the whole string is O(N^2)... Ouch. Efficient algorithms for UTF-8 tend to work sequentially -- and quite a few of them can be translated to string algorithms over the bytes.

There's no substitute for understanding how Unicode and UTF-8 and related representations work -- if you work with it enough, you'll see all kinds of malformed text and you'll need to be able to deal with it.

_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Reply via email to