Re: Detab should be multi-byte aware?

Michel Fortin Mon, 09 Oct 2006 18:33:53 -0700

Le 9 oct. 2006 à 20:34, John Gruber a écrit :

Michel Fortin <[EMAIL PROTECTED]> wrote on 10/9/06 at 8:26 PM:

If anyone is interested in a fix for PHP Markdown, just change
the call to the `strlen` function within detab to a call to
`mb_strlen($line, 'utf-8')`. I'll fix this for the next
version.


Will that still work if people pass in Windows Latin 1 or Mac
Roman-encoded text? Yes, I'm too lazy to try it...

I haven't tried it inside PHP Markdown yet, but I've tested`mb_strlen` and it seems to treat any invalid UTF-8 byte sequence asindividual characters. So the neat result is that text in ISO Latin,Windows Latin, or Mac Roman will work fine unless it containssequences which are valid UTF-8. For instance, "é" in UTF-8 is seenas "√©" in Mac Roman, so if you have "√©" in a Mac Roman-encoded textit'll be treated as only one character. I'm not sure how high is thatrisk for all character combinaisons, but it obviously is lessproblematic than the current behaviour is to UTF-8.

Another solution is to omit the 'utf-8' parameter and rely on the PHPinternal encoding to be the same as the input. (The internal encodingcan be set by the user using `mb_internal_encoding('utf-8')`.) Doingthat however implies that PHP Markdown will work with something elsethan UTF-8 by default, and I'm not so sure if that's a good idea.

Yet another solution is a distinct configuration variable set toUTF-8 by default.



Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/


_______________________________________________
Markdown-Discuss mailing list
Markdown-Discuss@six.pairlist.net
http://six.pairlist.net/mailman/listinfo/markdown-discuss

Re: Detab should be multi-byte aware?

Reply via email to