Re: UTF-8 question from Dive into Python 3

2011-01-20 Thread jmfauth
On Jan 19, 11:33 pm, Terry Reedy tjre...@udel.edu wrote:
 On 1/19/2011 1:02 PM, Tim Harig wrote:

  Right, but I only have to do that once.  After that, I can directly address
  any piece of the stream that I choose.  If I leave the information as a
  simple UTF-8 stream, I would have to walk the stream again, I would have to
  walk through the the first byte of all the characters from the beginning to
  make sure that I was only counting multibyte characters once until I found
  the character that I actually wanted.  Converting to a fixed byte
  representation (UTF-32/UCS-4) or separating all of the bytes for each
  UTF-8 into 6 byte containers both make it possible to simply index the
  letters by a constant size.  You will note that Python does the former.

 The idea of using a custom fixed-width padded version of a UTF-8 steams
 waw initially shocking to me, but I can imagine that there are
 specialized applications, which slice-and-dice uninterpreted segments,
 for which that is appropriate. However, it is not germane to the folly
 of prefixing standard UTF-8 steams with a 3-byte magic number,
 mislabelled a 'byte-order-mark, thus making them non-standard.



Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe
*Unicode Signature*.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Tim Roberts t...@probo.com wrote:
 Tim Harig user...@ilthio.net wrote:
On 2011-01-17, carlo syseng...@gmail.com wrote:

 2- If that were true, can you point me to some documentation about the
 math that, as Mark says, demonstrates this?

It is true because UTF-8 is essentially an 8 bit encoding that resorts
to the next bit once it exhausts the addressible space of the current
byte it moves to the next one.  Since the bytes are accessed and assessed
sequentially, they must be in big-endian order.

 You were doing excellently up to that last phrase.  Endianness only applies
 when you treat a series of bytes as a larger entity.  That doesn't apply to
 UTF-8.  None of the bytes is more significant than any other, so by
 definition it is neither big-endian or little-endian.

It depends how you process it and it doesn't generally make much
difference in Python.  Accessing UTF-8 data from C can be much trickier
if you use a multibyte type to store the data.  In that case, if happen
to be on a little-endian architecture, it may be necessary to remember
that the data is not in the order that your processor expects it to be
for numeric operations and comparisons.  That is why the FAQ I linked to
says yes to the fact that you can consider UTF-8 to always be in big-endian
order.  Essentially all byte based data is big-endian.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 11:34:53 + (UTC)
Tim Harig user...@ilthio.net wrote:
 That is why the FAQ I linked to
 says yes to the fact that you can consider UTF-8 to always be in big-endian
 order.

It certainly doesn't. Read better.

 Essentially all byte based data is big-endian.

This is pure nonsense.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
Considering you post contained no information or evidence for your
negations, I shouldn't even bother responding.  I will bite once.
Hopefully next time your arguments will contain some pith.

On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
 On Wed, 19 Jan 2011 11:34:53 + (UTC)
 Tim Harig user...@ilthio.net wrote:
 That is why the FAQ I linked to
 says yes to the fact that you can consider UTF-8 to always be in big-endian
 order.

 It certainly doesn't. Read better.

- Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
- yes, then can I still assume the remaining UTF-8 bytes are in big-endian
^^
- order?
  ^^
- 
- A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
 ^^^ 
- to the endianness of the byte stream. UTF-8 always has the same byte
    ^^
- order. An initial BOM is only used as a signature -- an indication that
  ^^
- an otherwise unmarked text file is in UTF-8. Note that some recipients of
- UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently
- in 8-bit environments, the use of a BOM will interfere with any protocol
- or file format that expects specific ASCII characters at the beginning,
- such as the use of #! of at the beginning of Unix shell scripts.

The question that was not addressed was whether you can consider UTF-8
to be little endian.  I pointed out why you cannot always make that
assumption in my previous post.

UTF-8 has no apparent endianess if you only store it as a byte stream.
It does however have a byte order.  If you store it using multibytes
(six bytes for all UTF-8 possibilites) , which is useful if you want
to have one storage container for each letter as opposed to one for
each byte(1), the bytes will still have the same order but you have
interrupted its sole existance as a byte stream and have returned it
to the underlying multibyte oriented representation.  If you attempt
any numeric or binary operations on what is now a multibyte sequence,
the processor will interpret the data using its own endian rules.

If your processor is big-endian, then you don't have any problems.
The processor will interpret the data in the order that it is stored.
If your processor is little endian, then it will effectively change the
order of the bytes for its own evaluation.

So, you can always assume a big-endian and things will work out correctly
while you cannot always make the same assumption as little endian
without potential issues.  The same holds true for any byte stream data.
That is why I say that byte streams are essentially big endian.  It is
all a matter of how you look at it.

I prefer to look at all data as endian even if it doesn't create
endian issues because it forces me to consider any endian issues that
might arise.  If none do, I haven't really lost anything.  If you simply
assume that any byte sequence cannot have endian issues you ignore the
possibility that such issues might not arise.  When an issue like the
above does, you end up with a potential bug.

(1) For unicode it is probably better to convert to characters to
UTF-32/UCS-4 for internal processing; but, creating a container large
enough to hold any length of UTF-8 character will work.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 14:00:13 + (UTC)
Tim Harig user...@ilthio.net wrote:
 
 - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
 - yes, then can I still assume the remaining UTF-8 bytes are in big-endian
 ^^
 - order?
   ^^
 - 
 - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
  ^^^ 
 - to the endianness of the byte stream. UTF-8 always has the same byte
 ^^
 - order.
   ^^

Which certainly doesn't mean that byte order can be called big
endian for any recognized definition of the latter. Similarly, ASCII
test has its own order which certainly can't be characterized as either
little endian or big endian.

 UTF-8 has no apparent endianess if you only store it as a byte stream.
 It does however have a byte order.  If you store it using multibytes
 (six bytes for all UTF-8 possibilites) , which is useful if you want
 to have one storage container for each letter as opposed to one for
 each byte(1)

That's a ridiculous proposition. Why would you waste so much space?
UTF-8 exists *precisely* so that you can save space with most scripts.
If you are ready to use 4+ bytes per character, just use UTF-32 which
has much nicer properties.

Bottom line: you are not describing UTF-8, only your own foolish
interpretation of it. UTF-8 does not have any endianness since it is a
byte stream and does not care about machine words.

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Adam Skutt
On Jan 19, 9:00 am, Tim Harig user...@ilthio.net wrote:

 So, you can always assume a big-endian and things will work out correctly
 while you cannot always make the same assumption as little endian
 without potential issues.  The same holds true for any byte stream data.

You need to spend some serious time programming a serial port or other
byte/bit-stream oriented interface, and then you'll realize the folly
of your statement.

 That is why I say that byte streams are essentially big endian.  It is
 all a matter of how you look at it.

It is nothing of the sort.  Some byte streams are in fact, little
endian: when the bytes are combined into larger objects, the least-
significant byte in the object comes first.  A lot of industrial/
embedded stuff has byte streams with LSB leading in the sequence, CAN
comes to mind as an example.

The only way to know is for the standard describing the stream to tell
you what to do.


 I prefer to look at all data as endian even if it doesn't create
 endian issues because it forces me to consider any endian issues that
 might arise.  If none do, I haven't really lost anything.  
 If you simply assume that any byte sequence cannot have endian issues you 
 ignore the
 possibility that such issues might not arise.

No, you must assume nothing unless you're told how to combine the
bytes within a sequence into a larger element.  Plus, not all byte
streams support such operations!  Some byte streams really are just a
sequence of bytes and the bytes within the stream cannot be
meaningfully combined into larger data types. If I give you a series
of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell
me how to combine them into a 16, 32, or 64-bit integer.  You cannot
do it without altering the meaning of the samples; it is a completely
non-nonsensical operation.

Adam
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Adam Skutt ask...@gmail.com wrote:
 On Jan 19, 9:00 am, Tim Harig user...@ilthio.net wrote:
 That is why I say that byte streams are essentially big endian.  It is
 all a matter of how you look at it.

 It is nothing of the sort.  Some byte streams are in fact, little
 endian: when the bytes are combined into larger objects, the least-
 significant byte in the object comes first.  A lot of industrial/
 embedded stuff has byte streams with LSB leading in the sequence, CAN
 comes to mind as an example.

You are correct.  Point well made.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
 On Wed, 19 Jan 2011 14:00:13 + (UTC)
 Tim Harig user...@ilthio.net wrote:
 UTF-8 has no apparent endianess if you only store it as a byte stream.
 It does however have a byte order.  If you store it using multibytes
 (six bytes for all UTF-8 possibilites) , which is useful if you want
 to have one storage container for each letter as opposed to one for
 each byte(1)

 That's a ridiculous proposition. Why would you waste so much space?

Space is only one tradeoff.  There are many others to consider.  I have
created data structures with much higher overhead than that because
they happen to make the problem easier and significantly faster for the
operations that I am performing on the data.

For many operations, it is just much faster and simpler to use a single
character based container opposed to having to process an entire byte
stream to determine individual letters from the bytes or to having
adaptive size containers to store the data.

 UTF-8 exists *precisely* so that you can save space with most scripts.

UTF-8 has many reasons for existing.  One of the biggest is that it
is compatible for tools that were designed to process ASCII and other
8bit encodings.

 If you are ready to use 4+ bytes per character, just use UTF-32 which
 has much nicer properties.

I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might
not want to have to worry about converting the encodings back and forth
before and after processing them.  That said, and more importantly, many
variable length byte streams may not have alternate representations as
unicode does.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 16:03:11 + (UTC)
Tim Harig user...@ilthio.net wrote:
 
 For many operations, it is just much faster and simpler to use a single
 character based container opposed to having to process an entire byte
 stream to determine individual letters from the bytes or to having
 adaptive size containers to store the data.

You *have* to process the entire byte stream in order to determine
boundaries of individual letters from the bytes if you want to use a
character based container, regardless of the exact representation.
Once you do that it shouldn't be very costly to compute the actual code
points. So, much faster sounds a bit dubious to me; especially if you
factor in the cost of memory allocation, and the fact that a larger
container will fit less easily in a data cache.

 That said, and more importantly, many
 variable length byte streams may not have alternate representations as
 unicode does.

This whole thread is about UTF-8 (see title) so I'm not sure what kind
of relevance this is supposed to have.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
 On Wed, 19 Jan 2011 16:03:11 + (UTC)
 Tim Harig user...@ilthio.net wrote:
 
 For many operations, it is just much faster and simpler to use a single
 character based container opposed to having to process an entire byte
 stream to determine individual letters from the bytes or to having
 adaptive size containers to store the data.

 You *have* to process the entire byte stream in order to determine
 boundaries of individual letters from the bytes if you want to use a
 character based container, regardless of the exact representation.

Right, but I only have to do that once.  After that, I can directly address
any piece of the stream that I choose.  If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted.  Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size.  You will note that Python does the former.

UTF-32/UCS-4 conversion is definitly supperior if you are actually
doing any major but it adds the complexity and overhead of requiring
the bit twiddling to make the conversions (once in, once again out).
Some programs don't really care enough about what the data actually
contains to make it worth while.  They just want to be able to use the
characters as black boxes.

 Once you do that it shouldn't be very costly to compute the actual code
 points. So, much faster sounds a bit dubious to me; especially if you

You could I suppose keep a separate list of pointers to each letter so that
you could use the pointer list for indexing or keep a list of the
character sizes so that you can add them and calculate the variable width
index; but, that adds overhead as well.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 18:02:22 + (UTC)
Tim Harig user...@ilthio.net wrote:
 On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
  On Wed, 19 Jan 2011 16:03:11 + (UTC)
  Tim Harig user...@ilthio.net wrote:
  
  For many operations, it is just much faster and simpler to use a single
  character based container opposed to having to process an entire byte
  stream to determine individual letters from the bytes or to having
  adaptive size containers to store the data.
 
  You *have* to process the entire byte stream in order to determine
  boundaries of individual letters from the bytes if you want to use a
  character based container, regardless of the exact representation.
 
 Right, but I only have to do that once.

You only have to decode once as well.

 If I leave the information as a
 simple UTF-8 stream,

That's not what we are talking about. We are talking about the supposed
benefits of your 6-byte representation scheme versus proper decoding
into fixed width code points.

 UTF-32/UCS-4 conversion is definitly supperior if you are actually
 doing any major but it adds the complexity and overhead of requiring
 the bit twiddling to make the conversions (once in, once again out).

Bit twiddling is not something processors are particularly bad at.
Actually, modern processors are much better at arithmetic and logic
than at recovering from mispredicted branches, which seems to suggest
that discovering boundaries probably eats most of the CPU cycles.

 Converting to a fixed byte
 representation (UTF-32/UCS-4) or separating all of the bytes for each
 UTF-8 into 6 byte containers both make it possible to simply index the
 letters by a constant size.  You will note that Python does the
 former.

Indeed, Python chose the wise option. Actually, I'd be curious of any
real-world software which successfully chose your proposed approach.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Tim Harig
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
 On Wed, 19 Jan 2011 18:02:22 + (UTC)
 Tim Harig user...@ilthio.net wrote:
 Converting to a fixed byte
 representation (UTF-32/UCS-4) or separating all of the bytes for each
 UTF-8 into 6 byte containers both make it possible to simply index the
 letters by a constant size.  You will note that Python does the
 former.

 Indeed, Python chose the wise option. Actually, I'd be curious of any
 real-world software which successfully chose your proposed approach.

The point is basically the same.  I created an example because it
was simpler to follow for demonstration purposes then an actual UTF-8
conversion to any official multibyte format.  You obviously have no
other purpose then to be contrary, so we ended up following tangents.

As soon as you start to convert to a multibyte format the endian issues
occur.  For UTF-8 on big endian hardware, this is anti-climactic because
all of the bits are already stored in proper order.  Little endian systems
will probably convert to a native native endian format.  If you choose
to ignore that, that is your perogative.  Have a nice day.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Antoine Pitrou
On Wed, 19 Jan 2011 19:18:49 + (UTC)
Tim Harig user...@ilthio.net wrote:
 On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote:
  On Wed, 19 Jan 2011 18:02:22 + (UTC)
  Tim Harig user...@ilthio.net wrote:
  Converting to a fixed byte
  representation (UTF-32/UCS-4) or separating all of the bytes for each
  UTF-8 into 6 byte containers both make it possible to simply index the
  letters by a constant size.  You will note that Python does the
  former.
 
  Indeed, Python chose the wise option. Actually, I'd be curious of any
  real-world software which successfully chose your proposed approach.
 
 The point is basically the same.  I created an example because it
 was simpler to follow for demonstration purposes then an actual UTF-8
 conversion to any official multibyte format.  You obviously have no
 other purpose then to be contrary [...]

Right. You were the one who jumped in and tried to lecture everyone on
how UTF-8 was big-endian, and now you are abandoning the one esoteric
argument you found in support of that.

 As soon as you start to convert to a multibyte format the endian issues
 occur.

Ok. Good luck with your endian issues which don't exist.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-19 Thread Terry Reedy

On 1/19/2011 1:02 PM, Tim Harig wrote:


Right, but I only have to do that once.  After that, I can directly address
any piece of the stream that I choose.  If I leave the information as a
simple UTF-8 stream, I would have to walk the stream again, I would have to
walk through the the first byte of all the characters from the beginning to
make sure that I was only counting multibyte characters once until I found
the character that I actually wanted.  Converting to a fixed byte
representation (UTF-32/UCS-4) or separating all of the bytes for each
UTF-8 into 6 byte containers both make it possible to simply index the
letters by a constant size.  You will note that Python does the former.


The idea of using a custom fixed-width padded version of a UTF-8 steams 
waw initially shocking to me, but I can imagine that there are 
specialized applications, which slice-and-dice uninterpreted segments, 
for which that is appropriate. However, it is not germane to the folly 
of prefixing standard UTF-8 steams with a 3-byte magic number, 
mislabelled a 'byte-order-mark, thus making them non-standard.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-18 Thread Raymond Hettinger
On Jan 17, 2:19 pm, carlo syseng...@gmail.com wrote:
 Hi,
 recently I had to study *seriously* Unicode and encodings for one
 project in Python but I left with a couple of doubts arised after
 reading the unicode chapter of Dive into Python 3 book by Mark
 Pilgrim.

 1- Mark says:
 Also (and you’ll have to trust me on this, because I’m not going to
 show you the math), due to the exact nature of the bit twiddling,
 there are no byte-ordering issues. A document encoded in UTF-8 uses
 the exact same stream of bytes on any computer.
  . . .
 2- If that were true, can you point me to some documentation about the
 math that, as Mark says, demonstrates this?

I believe Mark was referring to the bit-twiddling described in
the Design section at http://en.wikipedia.org/wiki/UTF-8 .

Raymond
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-18 Thread Tim Roberts
Tim Harig user...@ilthio.net wrote:
On 2011-01-17, carlo syseng...@gmail.com wrote:

 2- If that were true, can you point me to some documentation about the
 math that, as Mark says, demonstrates this?

It is true because UTF-8 is essentially an 8 bit encoding that resorts
to the next bit once it exhausts the addressible space of the current
byte it moves to the next one.  Since the bytes are accessed and assessed
sequentially, they must be in big-endian order.

You were doing excellently up to that last phrase.  Endianness only applies
when you treat a series of bytes as a larger entity.  That doesn't apply to
UTF-8.  None of the bytes is more significant than any other, so by
definition it is neither big-endian or little-endian.
-- 
Tim Roberts, t...@probo.com
Providenza  Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Alexander Kapps

On 17.01.2011 23:19, carlo wrote:


Is it true UTF-8 does not have any big-endian/little-endian issue
because of its encoding method? And if it is true, why Mark (and
everyone does) writes about UTF-8 with and without BOM some chapters
later? What would be the BOM purpose then?


Can't answer your other questions, but the UTF-8 BOM is simply a 
marker saying This is a UTF-8 text file, not an ASCII text file


If I'm not wrong, this was a Microsoft invention and surely one of 
their brightest ideas. I really wish, that this had been done for 
ANSI some decades ago. Determining the encoding for text files is 
hard to impossible because such a mark was never introduced.

--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Tim Harig
On 2011-01-17, carlo syseng...@gmail.com wrote:
 Is it true UTF-8 does not have any big-endian/little-endian issue
 because of its encoding method? And if it is true, why Mark (and
 everyone does) writes about UTF-8 with and without BOM some chapters
 later? What would be the BOM purpose then?

Yes, it is true.  The BOM simply identifies that the encoding as a UTF-8.:

http://unicode.org/faq/utf_bom.html#bom5

 2- If that were true, can you point me to some documentation about the
 math that, as Mark says, demonstrates this?

It is true because UTF-8 is essentially an 8 bit encoding that resorts
to the next bit once it exhausts the addressible space of the current
byte it moves to the next one.  Since the bytes are accessed and assessed
sequentially, they must be in big-endian order.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread Antoine Pitrou
On Mon, 17 Jan 2011 14:19:13 -0800 (PST)
carlo syseng...@gmail.com wrote:
 Is it true UTF-8 does not have any big-endian/little-endian issue
 because of its encoding method?

Yes.

 And if it is true, why Mark (and
 everyone does) writes about UTF-8 with and without BOM some chapters
 later? What would be the BOM purpose then?

BOM in this case is a misnomer. For UTF-8, it is only used as a
marker (a magic number, if you like) to signal than a given text file
is UTF-8. The UTF-8 BOM does not say anything about byte order; and,
actually, it does not change with endianness.

(note that it is not required to put an UTF-8 BOM at the beginning of
text files; it is just a hint that some tools use when
generating/reading UTF-8)

 2- If that were true, can you point me to some documentation about the
 math that, as Mark says, demonstrates this?

Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
encoding. There is no math involved, it just works by construction.

Regards

Antoine.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 question from Dive into Python 3

2011-01-17 Thread carlo
On 17 Gen, 23:34, Antoine Pitrou solip...@pitrou.net wrote:
 On Mon, 17 Jan 2011 14:19:13 -0800 (PST)

 carlo syseng...@gmail.com wrote:
  Is it true UTF-8 does not have any big-endian/little-endian issue
  because of its encoding method?

 Yes.

  And if it is true, why Mark (and
  everyone does) writes about UTF-8 with and without BOM some chapters
  later? What would be the BOM purpose then?

 BOM in this case is a misnomer. For UTF-8, it is only used as a
 marker (a magic number, if you like) to signal than a given text file
 is UTF-8. The UTF-8 BOM does not say anything about byte order; and,
 actually, it does not change with endianness.

 (note that it is not required to put an UTF-8 BOM at the beginning of
 text files; it is just a hint that some tools use when
 generating/reading UTF-8)

  2- If that were true, can you point me to some documentation about the
  math that, as Mark says, demonstrates this?

 Math? UTF-8 is simply a byte-oriented (rather than word-oriented)
 encoding. There is no math involved, it just works by construction.

 Regards

 Antoine.

thank you all, eventually found 
http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404
which clears up.
No math in fact, as Tim and Antoine pointed out.
-- 
http://mail.python.org/mailman/listinfo/python-list