On Oct 17, 2017, at 12:02 PM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> 
> On 2017-10-17 20:25, Israel Brewster wrote:
>> 
>>> On Oct 17, 2017, at 10:35 AM, MRAB <pyt...@mrabarnett.plus.com 
>>> <mailto:pyt...@mrabarnett.plus.com>> wrote:
>>> 
>>> On 2017-10-17 18:26, Israel Brewster wrote:
>>>> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 
>>>> 4D database, and I've run into a situation where corrupted string data 
>>>> from the database can cause the module to error out. Specifically, when 
>>>> decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't 
>>>> decode bytes in position 86-87: illegal UTF-16 surrogate" error. This 
>>>> makes sense, given that the string data got corrupted somehow, but the 
>>>> question is "what is the proper way to deal with this in the module?" 
>>>> Should I just throw an error on bad data? Or would it be better to set the 
>>>> errors parameter to something like "replace"? The former feels a bit more 
>>>> "proper" to me (there's an error here, so we throw an error), but leaves 
>>>> the end user dead in the water, with no way to retrieve *any* of the data 
>>>> (from that row at least, and perhaps any rows after it as well). The 
>>>> latter option sort of feels like sweeping the problem under the rug, but 
>>>> does at least leave an error character in the s
>>> tring to
>>> l
>>>>  et them know there was an error, and will allow retrieval of any good 
>>>> data.
>>>> Of course, if this was in my own code I could decide on a case-by-case 
>>>> basis what the proper action is, but since this a module that has to work 
>>>> in any situation, it's a bit more complicated.
>>> If a particular text field is corrupted, then raising UnicodeDecodeError 
>>> when trying to get the contents of that field as a Unicode string seems 
>>> reasonable to me.
>>> 
>>> Is there a way to get the contents as a bytestring, or to get the contents 
>>> with a different errors parameter, so that the user has the means to fix it 
>>> (if it's fixable)?
>> 
>> That's certainly a possibility, if that behavior conforms to the DB API 
>> "standards". My concern in this front is that in my experience working with 
>> other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns 
>> designated as type VARCHAR or TEXT are returned as strings (unicode in 
>> python 2, although that may have been a setting I used), not bytes. The 
>> other complication here is that the 4D database doesn't use the UTF-8 
>> encoding typically found, but rather UTF-16LE, and I don't know how well 
>> this is documented. So not only is the bytes representation completely 
>> unintelligible for human consumption, I'm not sure the average end-user 
>> would know what decoding to use.
>> 
>> In the end though, the main thing in my mind is to maintain "standards" 
>> compatibility - I don't want to be returning bytes if all other DB API 
>> modules return strings, or visa-versa for that matter. There may be some 
>> flexibility there, but as much as possible I want to conform to the 
>> majority/standard/whatever
>> 
> The average end-user might not know which encoding is being used, but 
> providing a way to read the underlying bytes will give a more experienced 
> user the means to investigate and possibly fix it: get the bytes, figure out 
> what the string should be, update the field with the correctly decoded string 
> using normal DB instructions.

I agree, and if I was just writing some random module I'd probably go with it, 
or perhaps with the suggestion offered by Karsten Hilbert. However, neither 
answer addresses my actual question, which is "how does the STANDARD (PEP 249 
in this case) say to handle this, or, baring that (since the standard probably 
doesn't explicitly say), how do the MAJORITY of PEP 249 compliant modules 
handle this?" Not what is the *best* way to handle it, but rather what is the 
normal, expected behavior for a Python DB API module when presented with bad 
data? That is, how does psycopg2 behave? pyodbc? pymssql (I think)? Etc. Or is 
that portion of the behavior completely arbitrary and different for every 
module?

It may well be that one of the suggestions *IS* the normal, expected, behavior, 
but it sounds more like you are suggesting how you think would be best to 
handle it, which is appreciated but not actually what I'm asking :-) Sorry if I 
am being difficult.

> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to