Re: String character encoding when converting data from one type/format to another

Ned Batchelder Wed, 07 Jan 2015 04:10:17 -0800

On 1/7/15 6:04 AM, Jacob Kruger wrote:

I'm busy using something like pyodbc to pull data out of MS access .mdb
files, and then generate .sql script files to execute against MySQL
databases using MySQLdb module, but, issue is forms of characters in
string values that don't fit inside the 0-127 range - current one seems
to be something like \xa3, and if I pass it through ord() function, it
comes out as character number 163.
Now issue is, yes, could just run through the hundreds of thousands of
characters in these resulting strings, and strip out any that are not
within the basic 0-127 range, but, that could result in corrupting data
- think so anyway.


That will definitely corrupt your data, since you will be discarding data.

Anyway, issue is, for example, if I try something like
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
str('\xa3').encode('latin7') - that last one is actually our preferred
encoding for the MySQL database - they all just tell me they can't work
with a character out of range.

Are you using Python 2 or Python 3? This is one area where the two arevery different. I suspect you are on Python 2, in which case these allfail the same way because you are calling encode on a bytestring. Youcan't encode a bytestring, you can only encode a Unicode string, soencode is helpfully trying to decode your bytestring first, using thedefault encoding (ascii), and '\xa3' is not an ascii character.

If that was confusing, this talk covers these fundamentals:http://bit.ly/unipain .

Any thoughts on a sort of generic method/means to handle any/all
characters that might be out of range when having pulled them out of
something like these MS access databases?

The best thing is to know what encoding was used to produce these bytevalues. Then you can manipulate them as Unicode if you need to. Thesecond best thing is to simply pass them through as bytes.

Another side note is for binary values that might store binary values, I
use something like the following to generate hex-based strings that work
alright when then inserting said same binary values into longblob
fields, but, don't think this would really help for what are really just
most likely badly chosen copy/pasted strings from documents, with
strange encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\")
+ ", "
TIA

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."



--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list

Re: String character encoding when converting data from one type/format to another

Reply via email to