> On Oct 17, 2017, at 10:35 AM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> 
> On 2017-10-17 18:26, Israel Brewster wrote:
>> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 
>> 4D database, and I've run into a situation where corrupted string data from 
>> the database can cause the module to error out. Specifically, when decoding 
>> the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode 
>> bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, 
>> given that the string data got corrupted somehow, but the question is "what 
>> is the proper way to deal with this in the module?" Should I just throw an 
>> error on bad data? Or would it be better to set the errors parameter to 
>> something like "replace"? The former feels a bit more "proper" to me 
>> (there's an error here, so we throw an error), but leaves the end user dead 
>> in the water, with no way to retrieve *any* of the data (from that row at 
>> least, and perhaps any rows after it as well). The latter option sort of 
>> feels like sweeping the problem under the rug, but does at least leave an 
>> error character in the s
> tring to
> l
>>  et them know there was an error, and will allow retrieval of any good data.
>> Of course, if this was in my own code I could decide on a case-by-case basis 
>> what the proper action is, but since this a module that has to work in any 
>> situation, it's a bit more complicated.
> If a particular text field is corrupted, then raising UnicodeDecodeError when 
> trying to get the contents of that field as a Unicode string seems reasonable 
> to me.
> 
> Is there a way to get the contents as a bytestring, or to get the contents 
> with a different errors parameter, so that the user has the means to fix it 
> (if it's fixable)?

That's certainly a possibility, if that behavior conforms to the DB API 
"standards". My concern in this front is that in my experience working with 
other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns 
designated as type VARCHAR or TEXT are returned as strings (unicode in python 
2, although that may have been a setting I used), not bytes. The other 
complication here is that the 4D database doesn't use the UTF-8 encoding 
typically found, but rather UTF-16LE, and I don't know how well this is 
documented. So not only is the bytes representation completely unintelligible 
for human consumption, I'm not sure the average end-user would know what 
decoding to use.

In the end though, the main thing in my mind is to maintain "standards" 
compatibility - I don't want to be returning bytes if all other DB API modules 
return strings, or visa-versa for that matter. There may be some flexibility 
there, but as much as possible I want to conform to the 
majority/standard/whatever

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to