Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel

On 01/07/2015 08:38 AM, Jacob Kruger wrote:

Thanks.

Makes more sense now, and yes, using 2.7 here.

Unfortunately, while could pass the binary values into blob fields well
enough, using forms of parameterised statements, the actual generation
of sql script text files is a step they want to work with at times, if
someone is handling this on site, so had to work first with generating
string values, and then handle executing those statements against a
MySQL server later on using MySQLdb.



There must be an encoding method used for describing blobs in an sql 
statement.  Use exactly that method;  don't try to get creative.


For example in sqlite, use  sqlite.encode()

--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel

On 01/07/2015 08:32 AM, Jacob Kruger wrote:

Thanks.

Please don't top-post.  Put your responses after each quoted part you're 
responding to.  And if there are parts you're not responding to, please 
delete them.


Issue with knowing encoding could just be that am pretty sure at least
some of the data capture is done via copy/paste from one MS app to
another, which could possibly result in a whole bunch of different
character sets, etc. being copied across, so it comes down to that while
can't control sources of data, need to manipulate/work with it to make
it useful on our side now.



Copy/paste to/from properly written Windows programs is done in Unicode, 
so the problem should only be one of how the data was saved.  There, 
Windows is much more sloppy.


Chances are that a given machine will use a consistent encoding, so a 
given file should be consistent, unless it was used over a network.  And 
if all the machines that generate this data are in the same company, 
they might all use the same one as well.



--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger

Thanks.

Yes, sorry didn't mention 2.7, and, unfortunately in this sense, all of this 
will be running on windows machines.


Stay well

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."

- Original Message - 
From: "Dave Angel" 

To: 
Sent: Wednesday, January 07, 2015 2:22 PM
Subject: Re: String character encoding when converting data from one 
type/format to another




On 01/07/2015 06:04 AM, Jacob Kruger wrote:
I'm busy using something like pyodbc to pull data out of MS access .mdb 
files, and then generate .sql script files to execute


 against MySQL databases using MySQLdb module, but, issue is forms of 
characters in string values that don't fit inside


 the 0-127 range - current one seems to be something like \xa3, and if I 
pass it through ord() function,


 it comes out as character number 163.

First question, of course is what version of Python.  Clearly, you're not 
using Python 3.x, so I'll assume 2.7.  But you really should specify it in 
your query.


Next question is what OS you're using.  You're reading .mdb files, which 
are most likely created in Windows, but that doesn't guarantee you're 
actually using Windows to do this conversion.





Now issue is, yes, could just run through the hundreds of thousands of 
characters in these resulting strings, and strip out any that are not 
within the basic 0-127 range, but, that could result in corrupting data - 
think so anyway.


Anyway, issue is, for example, if I try something like 
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or 
str('\xa3').encode('latin7') - that last one is actually our preferred 
encoding for the MySQL database - they all just tell me they can't work 
with a character out of range.




That's because your input data isn't ASCII.  So before you encode it, you 
have to decode it.  Any idea what encoding it's already in?  Maybe it's in 
latin1, which permits all 256 values.  Or utf-8, which permits a few 
hundred thousand values, but uses multiple bytes for any of those over 
127.  Or any of hundreds of other encodings.


Does an .mdb file have a field specifying what encoding was used?  Or do 
you have some other external knowledge?


If you don't know what encoding it's currently in, you'll have to guess, 
and the guess you're using so far is ASCII, which you know is false.


As for the encoding you should actually use in the database, that almost 
certainly ought to be utf-8, which supports far more international 
characters than latin1.  And make sure the database has a way to tell the 
future user what encoding you picked.


Any thoughts on a sort of generic method/means to handle any/all 
characters that might be out of range when having pulled them out of 
something like these MS access databases?


The only invalid characters are those which aren't valid in the encoding 
used.  Those can probably be safely converted to "?" or something similar.




Another side note is for binary values that might store binary values, I 
use something like the following to generate hex-based strings that work 
alright when then inserting said same binary values into longblob fields, 
but, don't think this would really help for what are really just most 
likely badly chosen copy/pasted strings from documents, with strange 
encoding, or something:

#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "") 
+ ", "




Best to not pretend they're text at all.  But if your db doesn't support 
binary blobs, then use an encoding which supports all 256 values 
unambiguously, while producing printable characters.  Like uuencod, using 
module uu


You might also look into mime, where you store the encoding of the data 
with the data.  See for example mimetypes.




--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list



--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger

Thanks.

Makes more sense now, and yes, using 2.7 here.

Unfortunately, while could pass the binary values into blob fields well 
enough, using forms of parameterised statements, the actual generation of 
sql script text files is a step they want to work with at times, if someone 
is handling this on site, so had to work first with generating string 
values, and then handle executing those statements against a MySQL server 
later on using MySQLdb.


Stay well

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."

- Original Message - 
From: "Peter Otten" <__pete...@web.de>

To: 
Sent: Wednesday, January 07, 2015 2:11 PM
Subject: Re: String character encoding when converting data from one 
type/format to another




Jacob Kruger wrote:


I'm busy using something like pyodbc to pull data out of MS access .mdb
files, and then generate .sql script files to execute against MySQL
databases using MySQLdb module, but, issue is forms of characters in
string values that don't fit inside the 0-127 range - current one seems 
to

be something like \xa3, and if I pass it through ord() function, it comes
out as character number 163.

Now issue is, yes, could just run through the hundreds of thousands of
characters in these resulting strings, and strip out any that are not
within the basic 0-127 range, but, that could result in corrupting data -
think so anyway.

Anyway, issue is, for example, if I try something like
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or


"\xa3" already is a str; str("\xa3") is as redundant as
str(str(str("\xa3"))) ;)


str('\xa3').encode('latin7') - that last one is actually our preferred
encoding for the MySQL database - they all just tell me they can't work
with a character out of range.


encode() goes from unicode to byte; you want to convert bytes to unicode 
and

thus need decode().

In this context it is important that you tell us the Python version. In
Python 2 str.encode(encoding) is basically

str.decode("ascii").encode(encoding)

which is why you probably got a UnicodeDecodeError in the traceback:


"\xa3".encode("latin7")

Traceback (most recent call last):
 File "", line 1, in 
 File "/usr/lib/python2.7/encodings/iso8859_13.py", line 12, in encode
   return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0:
ordinal not in range(128)


"\xa3".decode("latin7")

u'\xa3'

print "\xa3".decode("latin7")

£

Aside: always include the traceback in your posts -- and always read it
carefully. The fact that "latin7" is not mentioned might have given you a
hint that the problem was not what you thought it was.


Any thoughts on a sort of generic method/means to handle any/all
characters that might be out of range when having pulled them out of
something like these MS access databases?


Assuming the data in Access is not broken and that you know the encoding
decode() will work.


Another side note is for binary values that might store binary values, I
use something like the following to generate hex-based strings that work
alright when then inserting said same binary values into longblob fields,
but, don't think this would really help for what are really just most
likely badly chosen copy/pasted strings from documents, with strange
encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "") 
+

", "


I would expect that you can feed bytestrings directly into blobs, without
any preparatory step. Try it, and if you get failures show us the failing
code and the corresponding traceback.

--
https://mail.python.org/mailman/listinfo/python-list



--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger

Thanks.

Yes, using python 2.7, and all you said makes sense, so will check out the 
talk, and the byte'ing, etc. (yes, bad joke, I know)


Issue with knowing encoding could just be that am pretty sure at least some 
of the data capture is done via copy/paste from one MS app to another, which 
could possibly result in a whole bunch of different character sets, etc. 
being copied across, so it comes down to that while can't control sources of 
data, need to manipulate/work with it to make it useful on our side now.


Thanks again

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."

- Original Message - 
From: "Ned Batchelder" 

To: 
Sent: Wednesday, January 07, 2015 2:02 PM
Subject: Re: String character encoding when converting data from one 
type/format to another




On 1/7/15 6:04 AM, Jacob Kruger wrote:

I'm busy using something like pyodbc to pull data out of MS access .mdb
files, and then generate .sql script files to execute against MySQL
databases using MySQLdb module, but, issue is forms of characters in
string values that don't fit inside the 0-127 range - current one seems
to be something like \xa3, and if I pass it through ord() function, it
comes out as character number 163.
Now issue is, yes, could just run through the hundreds of thousands of
characters in these resulting strings, and strip out any that are not
within the basic 0-127 range, but, that could result in corrupting data
- think so anyway.


That will definitely corrupt your data, since you will be discarding data.


Anyway, issue is, for example, if I try something like
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
str('\xa3').encode('latin7') - that last one is actually our preferred
encoding for the MySQL database - they all just tell me they can't work
with a character out of range.


Are you using Python 2 or Python 3? This is one area where the two are 
very different.  I suspect you are on Python 2, in which case these all 
fail the same way because you are calling encode on a bytestring.  You 
can't encode a bytestring, you can only encode a Unicode string, so encode 
is helpfully trying to decode your bytestring first, using the default 
encoding (ascii), and '\xa3' is not an ascii character.


If that was confusing, this talk covers these fundamentals: 
http://bit.ly/unipain .



Any thoughts on a sort of generic method/means to handle any/all
characters that might be out of range when having pulled them out of
something like these MS access databases?


The best thing is to know what encoding was used to produce these byte 
values.  Then you can manipulate them as Unicode if you need to.  The 
second best thing is to simply pass them through as bytes.



Another side note is for binary values that might store binary values, I
use something like the following to generate hex-based strings that work
alright when then inserting said same binary values into longblob
fields, but, don't think this would really help for what are really just
most likely badly chosen copy/pasted strings from documents, with
strange encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "")
+ ", "
TIA

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."





--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list



--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Chris Angelico
On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder  wrote:
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
>
>
> The best thing is to know what encoding was used to produce these byte
> values.  Then you can manipulate them as Unicode if you need to.  The second
> best thing is to simply pass them through as bytes.

If you can't know for sure, you could hazard a guess. There's a good
chance that an eight-bit encoding from a Microsoft product is CP-1252.
In fact, when I interoperate with Unicode-unaware Windows programs, I
usually attempt a UTF-8 decode, and if that fails, I simply assume
CP-1252; this generally gives correct results for data coming from
US-English Windows users.

Jacob, have a look at your data. Contextually, would the '\xa3' be
likely to be a pound sign, £? Would '\x85' make sense as an ellipsis?
Would \x90, \x91, \x92, and \x93 seem to be used for quote marks? If
so, CP-1252 would be the encoding to use.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel

On 01/07/2015 06:04 AM, Jacob Kruger wrote:

I'm busy using something like pyodbc to pull data out of MS access .mdb files, 
and then generate .sql script files to execute


 against MySQL databases using MySQLdb module, but, issue is forms of 
characters in string values that don't fit inside


 the 0-127 range - current one seems to be something like \xa3, and if 
I pass it through ord() function,


 it comes out as character number 163.

First question, of course is what version of Python.  Clearly, you're 
not using Python 3.x, so I'll assume 2.7.  But you really should specify 
it in your query.


Next question is what OS you're using.  You're reading .mdb files, which 
are most likely created in Windows, but that doesn't guarantee you're 
actually using Windows to do this conversion.





Now issue is, yes, could just run through the hundreds of thousands of 
characters in these resulting strings, and strip out any that are not within 
the basic 0-127 range, but, that could result in corrupting data - think so 
anyway.

Anyway, issue is, for example, if I try something like 
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or 
str('\xa3').encode('latin7') - that last one is actually our preferred encoding 
for the MySQL database - they all just tell me they can't work with a character 
out of range.



That's because your input data isn't ASCII.  So before you encode it, 
you have to decode it.  Any idea what encoding it's already in?  Maybe 
it's in latin1, which permits all 256 values.  Or utf-8, which permits a 
few hundred thousand values, but uses multiple bytes for any of those 
over 127.  Or any of hundreds of other encodings.


Does an .mdb file have a field specifying what encoding was used?  Or do 
you have some other external knowledge?


If you don't know what encoding it's currently in, you'll have to guess, 
and the guess you're using so far is ASCII, which you know is false.


As for the encoding you should actually use in the database, that almost 
certainly ought to be utf-8, which supports far more international 
characters than latin1.  And make sure the database has a way to tell 
the future user what encoding you picked.



Any thoughts on a sort of generic method/means to handle any/all characters 
that might be out of range when having pulled them out of something like these 
MS access databases?


The only invalid characters are those which aren't valid in the encoding 
used.  Those can probably be safely converted to "?" or something similar.




Another side note is for binary values that might store binary values, I use 
something like the following to generate hex-based strings that work alright 
when then inserting said same binary values into longblob fields, but, don't 
think this would really help for what are really just most likely badly chosen 
copy/pasted strings from documents, with strange encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "") + ", "



Best to not pretend they're text at all.  But if your db doesn't support 
binary blobs, then use an encoding which supports all 256 values 
unambiguously, while producing printable characters.  Like uuencod, 
using module uu


You might also look into mime, where you store the encoding of the data 
with the data.  See for example mimetypes.




--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Peter Otten
Jacob Kruger wrote:

> I'm busy using something like pyodbc to pull data out of MS access .mdb
> files, and then generate .sql script files to execute against MySQL
> databases using MySQLdb module, but, issue is forms of characters in
> string values that don't fit inside the 0-127 range - current one seems to
> be something like \xa3, and if I pass it through ord() function, it comes
> out as character number 163.
> 
> Now issue is, yes, could just run through the hundreds of thousands of
> characters in these resulting strings, and strip out any that are not
> within the basic 0-127 range, but, that could result in corrupting data -
> think so anyway.
> 
> Anyway, issue is, for example, if I try something like
> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or

"\xa3" already is a str; str("\xa3") is as redundant as 
str(str(str("\xa3"))) ;)

> str('\xa3').encode('latin7') - that last one is actually our preferred
> encoding for the MySQL database - they all just tell me they can't work
> with a character out of range.

encode() goes from unicode to byte; you want to convert bytes to unicode and 
thus need decode().

In this context it is important that you tell us the Python version. In 
Python 2 str.encode(encoding) is basically 

str.decode("ascii").encode(encoding)

which is why you probably got a UnicodeDecodeError in the traceback:

>>> "\xa3".encode("latin7")
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.7/encodings/iso8859_13.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0: 
ordinal not in range(128)

>>> "\xa3".decode("latin7")
u'\xa3'
>>> print "\xa3".decode("latin7")
£

Aside: always include the traceback in your posts -- and always read it 
carefully. The fact that "latin7" is not mentioned might have given you a 
hint that the problem was not what you thought it was.

> Any thoughts on a sort of generic method/means to handle any/all
> characters that might be out of range when having pulled them out of
> something like these MS access databases?

Assuming the data in Access is not broken and that you know the encoding
decode() will work.

> Another side note is for binary values that might store binary values, I
> use something like the following to generate hex-based strings that work
> alright when then inserting said same binary values into longblob fields,
> but, don't think this would really help for what are really just most
> likely badly chosen copy/pasted strings from documents, with strange
> encoding, or something:
> #sample code line for binary encoding into string output
> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "") +
> ", "

I would expect that you can feed bytestrings directly into blobs, without 
any preparatory step. Try it, and if you get failures show us the failing 
code and the corresponding traceback.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Ned Batchelder

On 1/7/15 6:04 AM, Jacob Kruger wrote:

I'm busy using something like pyodbc to pull data out of MS access .mdb
files, and then generate .sql script files to execute against MySQL
databases using MySQLdb module, but, issue is forms of characters in
string values that don't fit inside the 0-127 range - current one seems
to be something like \xa3, and if I pass it through ord() function, it
comes out as character number 163.
Now issue is, yes, could just run through the hundreds of thousands of
characters in these resulting strings, and strip out any that are not
within the basic 0-127 range, but, that could result in corrupting data
- think so anyway.


That will definitely corrupt your data, since you will be discarding data.


Anyway, issue is, for example, if I try something like
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
str('\xa3').encode('latin7') - that last one is actually our preferred
encoding for the MySQL database - they all just tell me they can't work
with a character out of range.


Are you using Python 2 or Python 3? This is one area where the two are 
very different.  I suspect you are on Python 2, in which case these all 
fail the same way because you are calling encode on a bytestring.  You 
can't encode a bytestring, you can only encode a Unicode string, so 
encode is helpfully trying to decode your bytestring first, using the 
default encoding (ascii), and '\xa3' is not an ascii character.


If that was confusing, this talk covers these fundamentals: 
http://bit.ly/unipain .



Any thoughts on a sort of generic method/means to handle any/all
characters that might be out of range when having pulled them out of
something like these MS access databases?


The best thing is to know what encoding was used to produce these byte 
values.  Then you can manipulate them as Unicode if you need to.  The 
second best thing is to simply pass them through as bytes.



Another side note is for binary values that might store binary values, I
use something like the following to generate hex-based strings that work
alright when then inserting said same binary values into longblob
fields, but, don't think this would really help for what are really just
most likely badly chosen copy/pasted strings from documents, with
strange encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "")
+ ", "
TIA

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."





--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list


String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger
I'm busy using something like pyodbc to pull data out of MS access .mdb files, 
and then generate .sql script files to execute against MySQL databases using 
MySQLdb module, but, issue is forms of characters in string values that don't 
fit inside the 0-127 range - current one seems to be something like \xa3, and 
if I pass it through ord() function, it comes out as character number 163.

Now issue is, yes, could just run through the hundreds of thousands of 
characters in these resulting strings, and strip out any that are not within 
the basic 0-127 range, but, that could result in corrupting data - think so 
anyway.

Anyway, issue is, for example, if I try something like 
str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or 
str('\xa3').encode('latin7') - that last one is actually our preferred encoding 
for the MySQL database - they all just tell me they can't work with a character 
out of range.

Any thoughts on a sort of generic method/means to handle any/all characters 
that might be out of range when having pulled them out of something like these 
MS access databases?

Another side note is for binary values that might store binary values, I use 
something like the following to generate hex-based strings that work alright 
when then inserting said same binary values into longblob fields, but, don't 
think this would really help for what are really just most likely badly chosen 
copy/pasted strings from documents, with strange encoding, or something:
#sample code line for binary encoding into string output
s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "") + ", "

TIA

Jacob Kruger
Blind Biker
Skype: BlindZA
"Roger Wilco wants to welcome you...to the space janitor's closet..."
-- 
https://mail.python.org/mailman/listinfo/python-list