On 2017-10-17 20:25, Israel Brewster wrote:
On Oct 17, 2017, at 10:35 AM, MRAB <pyt...@mrabarnett.plus.com
<mailto:pyt...@mrabarnett.plus.com>> wrote:
On 2017-10-17 18:26, Israel Brewster wrote:
I have written and maintain a PEP 249 compliant (hopefully) DB API
for the 4D database, and I've run into a situation where corrupted
string data from the database can cause the module to error out.
Specifically, when decoding the string, I get a "UnicodeDecodeError:
'utf-16-le' codec can't decode bytes in position 86-87: illegal
UTF-16 surrogate" error. This makes sense, given that the string
data got corrupted somehow, but the question is "what is the proper
way to deal with this in the module?" Should I just throw an error
on bad data? Or would it be better to set the errors parameter to
something like "replace"? The former feels a bit more "proper" to me
(there's an error here, so we throw an error), but leaves the end
user dead in the water, with no way to retrieve *any* of the data
(from that row at least, and perhaps any rows after it as well). The
latter option sort of feels like sweeping the problem under the rug,
but does at least leave an error character in the s
tring to
l
et them know there was an error, and will allow retrieval of any
good data.
Of course, if this was in my own code I could decide on a
case-by-case basis what the proper action is, but since this a
module that has to work in any situation, it's a bit more complicated.
If a particular text field is corrupted, then raising
UnicodeDecodeError when trying to get the contents of that field as a
Unicode string seems reasonable to me.
Is there a way to get the contents as a bytestring, or to get the
contents with a different errors parameter, so that the user has the
means to fix it (if it's fixable)?
That's certainly a possibility, if that behavior conforms to the DB
API "standards". My concern in this front is that in my experience
working with other PEP 249 modules (specifically psycopg2), I'm pretty
sure that columns designated as type VARCHAR or TEXT are returned as
strings (unicode in python 2, although that may have been a setting I
used), not bytes. The other complication here is that the 4D database
doesn't use the UTF-8 encoding typically found, but rather UTF-16LE,
and I don't know how well this is documented. So not only is the bytes
representation completely unintelligible for human consumption, I'm
not sure the average end-user would know what decoding to use.
In the end though, the main thing in my mind is to maintain
"standards" compatibility - I don't want to be returning bytes if all
other DB API modules return strings, or visa-versa for that matter.
There may be some flexibility there, but as much as possible I want to
conform to the majority/standard/whatever
The average end-user might not know which encoding is being used, but
providing a way to read the underlying bytes will give a more
experienced user the means to investigate and possibly fix it: get the
bytes, figure out what the string should be, update the field with the
correctly decoded string using normal DB instructions.
--
https://mail.python.org/mailman/listinfo/python-list