Thanks for the help on this, everyone! I found two ways to deal with this and figured I should share in case it comes up in the future.
The first approach, the one I went with (because in my case, fidelity was not as important) was to alter the 'text_factory' the sqlite3 uses. One trick here was that to access this parameter for a sqlalchemy connection object 'conn' as returned from an engine, I had to do it like so: conn.connection.connection.text_factory The extra indirection on .connection* was because the connection's internal dbapi connection was in fact a proy '_ConnectionFairy' instance. A small issue but one that's difficult to notice at first as it raises no error to set a new property on an object that will never use that property. The second approach I found was more involved but would be more appropriate for large projects that didn't want to have this behavior on every column. In this approach you still change the text_factory of sqlite3's connection, but instead change it to `bytes` (or perhaps an identity? ie lambda x: x - not sure what is best). This will cause sqlite3 to return encoded utf-8 bytes instead of unicode strings. Then, you have to tell SQLAlchemy to convert these strings to unicode. I did not persue this approach far enough to find the right set of arguments but I imagine this would be very simple - set 'force_unicode' to True, I suspect, would be all you would need. Finally, for the column with the invalid utf-8 sequences, just also set the `unicode_error` to your preferred resolution strategy - usually 'ignore' or 'replace'. I suppose it is possible that this could incur a performance penalty - the sqlite3 de/encoding process is done in a compiled C module and as such could possibly be faster than using native python for the task. I suspect though that the module is just calling to the usual Python library functions for encoding/decoding (but did not check). So that may be of a concern if you follow this approach. Thanks again! On Thu, Feb 6, 2014 at 4:19 PM, Michael Bayer <mike...@zzzcomputing.com>wrote: > > On Feb 6, 2014, at 6:59 PM, Erich Blume <blume.er...@gmail.com> wrote: > > > Hmm, this one has me stumped. As best I can tell after poking at it > using the column_reflect event, a custom dialect, etc. - the issue here is > that in pysqlite.py we (in my Python 3.3 install) are selecting > `sqlite3.dbapi2` as the dbapi interface, but we aren't telling sqlite3 > anything about how to treat unicode errors. From what I am reading (but it > seems inconsistent, maybe?) sqlite3 automatically decodes all database > retrieved values from their bytes for text fields, returning unicode > strings. Except... that doesn't always seem to be true. I hex-edited a db > file to change the utf-8 string "hello" to "hell" + 0x92 and sqlite3 > switched from returning "hello" to b"hell\x92", or something like that - > I've been poking at this for so long I've lost track of that transcript. > > > > One can override sqlite3's text factory, apparently, with (for instance) > `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`. > Maybe the key is to try and find a way to trigger that from sqlalchemy? I > tried and failed, maybe someone else can point me back to the path? > > > > Just to re-summarize the problem: In python 3, I'm getting errors trying > to read a row from a sqlite database that has a TEXT column with an invalid > utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just > have sqlalchemy move along and ignore the byte, but I'm not clear how to do > that. > > Pysqlite (e.g. sqlite3 module) returns TEXT as Python unicode out of the > gate. That exception message is being raised by sqlite3 itself, SQLAlchemy > is just a pass-through, as the string type knows on sqlite that the value > is already unicode. > > you might need to CAST() the value as BINARY perhaps, not sure. You’d > first want to get a plain sqlite3 script to do what you want. Setting a > “text_factory” at the module level of sqlite3 is certainly easy enough but > that seems way too broad. Ideally you’d want to be able to get the value > on a per-column basis. > > > -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscr...@googlegroups.com. To post to this group, send email to sqlalchemy@googlegroups.com. Visit this group at http://groups.google.com/group/sqlalchemy. For more options, visit https://groups.google.com/groups/opt_out.