On Wed, Dec 30, 2009 at 5:05 PM, Jerome Leclanche <adys...@gmail.com> wrote:
> When truncating characters, we are obviously talking about truncating just
> that: characters. Truncating bytes is a behaviour implemented by |slice.

You misunderstand: I'm not talking about bytes, I'm talking about
composed and decomposed characters.

For example, 'ΓΌ' can be represented as either:

1. 00fc  (LATIN SMALL LETTER U WITH DIARESIS), or

2. 0075 (LATIN SMALL LETTER U) *followed by* 0308 (COMBINING DIARESIS)

Option 1 is composed, option 2 is decomposed and is actually *two
Unicode characters*, not "two bytes", and so character-based slicing
will chop off the combining diaresis. The only way to avoid this is to
have the filter do Unicode normalization to composed characters (e.g.,
normalization form NFC or NFKC).


-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

--

You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-develop...@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.


Reply via email to