On Jun 17, 6:48 pm, Tzury <[EMAIL PROTECTED]> wrote: > On Jun 17, 10:48 am, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > > > > > I recently rewrote a .net application in python. > > > The application is basically gets streams via TCP socket and handle > > > operations against an existing database. > > > The Database is SQLite3 (Encoded as UTF-8). > > > The Networks streams are encoded as UCS-2. > > > > Since in UCS-2, 'A' = '0041' and when I check with the built-in > > > functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I > > > wonder what is the difference, and how can I safely encode/decode > > > UCS-2 streams and match them with the UTF-8 representation > > > In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean > > that the output is in UTF-8, but the *input*. > > So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in > > UTF-8, it consumes only one byte. > > > For different letters, that's different: For example, for u'\xf6', > > the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is > > 'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC' > > (i.e. three bytes). > > > You should use Unicode objects in your program always, and encode > > to or from UCS-2 or UTF-8 only when interfacing with the > > network/database. > > > HTH, > > Martin > > Thanks Martin for this guideline. But in fact say I get a USC-2 string > and need to compare it with UTF-8 value in the database. How can I do > it given the Python built-in libraries?
Use the str.decode method with the appropriate encoding. Borrowing Martin's last example: >>> '\xE2\x82\xAC'.decode('utf8') u'\u20ac' >>> '\x20\xAC'.decode('utf_16_be') u'\u20ac' BTW TLA 'USC' AAF SBE 'UCS' HTH SJM
-- http://mail.python.org/mailman/listinfo/python-list