On Mon, 2008-11-17 at 14:52 +0000, Darren Mansell wrote: > Hi. > > I'm relatively new to python so please be gentle :) > > I'm trying to write a £ symbol to an MS SQL server using pymsssql . This > works but when selecting the data back (e.g. using SQL management > studio) the £ symbol is replaced with £ (latin capital letter A with > circumflex). > > I can reproduce it like so: > > >>> con = > >>> pymssql.connect(host='testdb',user='testuser',password='password',database='test') > >>> sql = "insert into table_1 values ('£')" > >>> cur.execute(sql) > >>> con.commit() > >>> sql = "insert into table_1 values ('1')" > >>> cur.execute(sql) > >>> con.commit() > >>> sql = "select * from table_1" > >>> cur.execute(sql) > >>> cur.fetchall() > [('\xc2\xa3',), ('1',)] > > If I insert a £ symbol in using SQL management studio then select it back > from Python I get this: > > [('\xc2\xa3',), ('1',), ('\xa3',)] > > If I look in SQL management studio it says this: > > £ > > for the inserts I do through Python/pymssql. > > Does anyone have any ideas whats happening and how to fix it?
What you are seeing is the character codes for £ in UTF-8. The information is being sent from your database as a series of bytes. Unicode is an abstraction that consists of a series of characters (which in UTF-8, can be made up of 1, 2, 3, or 4 bytes. In order to get a unicode object from a bytestring, you need to use the decode method. If you decode from Latin-1 ('\xc2\xa3'.decode('latin-1')), you get £, if you decode from UTF-8 ('\xc2\xa3'.decode('utf-8')), you get £. Note that it is a pure coincidence that £ is present in the latin-1 decoding. In binary, '\xa3' looks like this: 10100011 or, 10 100011 To encode it to UTF-8, you prepend 110 to the first byte, (the two 1s indicate that the character will be two bytes long. A three byte character would start with 1110.) The second byte (and third, if it existed) start with 10, and the bits are packed as far to the right as possible in that framework. 110xxxxx 10xxxxxx 10 100011 11000010 10100011 1100 0010 1010 0011 c 2 a 3 Reverse the process to get a unicode character from a UTF-8 bytestring. See the wikipedia page on UTF-8 for a more complete explanation on how UTF-8 encoding works. > Thanks > -- http://mail.python.org/mailman/listinfo/python-list