Re: Detab should be multi-byte aware?

2006-10-10 Thread Michel Fortin
Le 10 oct. 2006 à 3:17, A. Pagaltzis a écrit : * John Gruber <[EMAIL PROTECTED]> [2006-10-10 05:55]: I think it's simpler and better to just say "use UTF-8". +1 UTF-8 is in fact deliberately constructed such that the chance of arbitrary text accidentally being valid UTF-8 approaches zero wit

Re: Detab should be multi-byte aware?

2006-10-10 Thread A. Pagaltzis
* John Gruber <[EMAIL PROTECTED]> [2006-10-10 05:55]: > I think it's simpler and better to just say "use UTF-8". +1 UTF-8 is in fact deliberately constructed such that the chance of arbitrary text accidentally being valid UTF-8 approaches zero with increasing length of the text. Regards, -- Ari

Re: Detab should be multi-byte aware?

2006-10-09 Thread John Gruber
Michel Fortin <[EMAIL PROTECTED]> wrote on 10/9/06 at 9:33 PM: I haven't tried it inside PHP Markdown yet, but I've tested `mb_strlen` and it seems to treat any invalid UTF-8 byte sequence as individual characters. So the neat result is that text in ISO Latin, Windows Latin, or Mac Roman will w

Re: Detab should be multi-byte aware?

2006-10-09 Thread Allan Odgaard
On 10. Oct 2006, at 03:33, Michel Fortin wrote: [...] I'm not sure how high is that risk for all character combinaisons, but it obviously is less problematic than the current behaviour is to UTF-8. This report http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11- UTF-8.pdf talks about prob

Re: Detab should be multi-byte aware?

2006-10-09 Thread Michel Fortin
Le 9 oct. 2006 à 20:34, John Gruber a écrit : Michel Fortin <[EMAIL PROTECTED]> wrote on 10/9/06 at 8:26 PM: If anyone is interested in a fix for PHP Markdown, just change the call to the `strlen` function within detab to a call to `mb_strlen($line, 'utf-8')`. I'll fix this for the next versio

Re: Detab should be multi-byte aware?

2006-10-09 Thread John Gruber
Michel Fortin <[EMAIL PROTECTED]> wrote on 10/9/06 at 8:26 PM: If anyone is interested in a fix for PHP Markdown, just change the call to the `strlen` function within detab to a call to `mb_strlen($line, 'utf-8')`. I'll fix this for the next version. Will that still work if people pass in Win

Re: Detab should be multi-byte aware?

2006-10-09 Thread Michel Fortin
Le 9 oct. 2006 à 19:43, Allan Odgaard a écrit : As you can see, expand is able to correctly convert tabs to spaces, where Markdown.pl counts the é as occupying two columns. Ah! Now I see what you mean. It makes perfect sense and is super-easy to reproduce. Thank you for that clear example.

Re: Detab should be multi-byte aware?

2006-10-09 Thread Allan Odgaard
On 10. Oct 2006, at 00:19, John Gruber wrote: [...] If Markdown.pl ever gains explicit support for text encodings, the rules will be simple: UTF-8 in, UTF-8 out, no exceptions. Or you could check the users locale (LC_CTYPE). Though hardcoding it to UTF-8 works for me. You can also verify

Re: Detab should be multi-byte aware?

2006-10-09 Thread Allan Odgaard
On 10. Oct 2006, at 00:52, Michel Fortin wrote: [...] From your description of the problem, I believe you're not using UTF-8. No, here is an example showing the problem: % Markdown.pl <<< $'Test:\nresume\tbar\nrésumé\tbar\n' Test: resume bar résumébar

Re: Detab should be multi-byte aware?

2006-10-09 Thread Michel Fortin
Le 9 oct. 2006 à 17:02, Allan Odgaard a écrit : As for #2, Markdown doesn’t know the encoding of the source document, so that would mean it can’t really be aware of things such as UTF-8 mb sequences, OTOH if it changes my pre-formatted text, I would like to have it do the right thing. Cur

Re: Detab should be multi-byte aware?

2006-10-09 Thread John Gruber
Allan Odgaard <[EMAIL PROTECTED]> wrote on 10/9/06 at 11:02 PM: This raises two questions: 1. Should Markdown convert tabs to spaces in pre-formated text? 2. If yes, should Markdown be aware of multi-byte characters? I’d say yes to #1 -- Markdown converts to (X)HTML which does not define

Detab should be multi-byte aware?

2006-10-09 Thread Allan Odgaard
A user has table-formatted data which contains accents and finds it problematic that his tables misalign after going through Markdown. This is because he made them align using tab characters and Markdown will convert these to spaces even in pre-formatted text and Markdown is not multi-byte