Thanks for the help on this, everyone! I found two ways to deal with this
and figured I should share in case it comes up in the future.

The first approach, the one I went with (because in my case, fidelity was
not as important) was to alter the 'text_factory' the sqlite3 uses. One
trick here was that to access this parameter for a sqlalchemy connection
object 'conn' as returned from an engine, I had to do it like so:

    conn.connection.connection.text_factory

The extra indirection on .connection* was because the connection's internal
dbapi connection was in fact a proy '_ConnectionFairy' instance. A small
issue but one that's difficult to notice at first as it raises no error to
set a new property on an object that will never use that property.

The second approach I found was more involved but would be more appropriate
for large projects that didn't want to have this behavior on every column.
In this approach you still change the text_factory of sqlite3's connection,
but instead change it to `bytes` (or perhaps an identity? ie lambda x: x -
not sure what is best). This will cause sqlite3 to return encoded utf-8
bytes instead of unicode strings.

Then, you have to tell SQLAlchemy to convert these strings to unicode. I
did not persue this approach far enough to find the right set of arguments
but I imagine this would be very simple - set 'force_unicode' to True, I
suspect, would be all you would need.

Finally, for the column with the invalid utf-8 sequences, just also set the
`unicode_error` to your preferred resolution strategy - usually 'ignore' or
'replace'.

I suppose it is possible that this could incur a performance penalty - the
sqlite3 de/encoding process is done in a compiled C module and as such
could possibly be faster than using native python for the task. I suspect
though that the module is just calling to the usual Python library
functions for encoding/decoding (but did not check). So that may be of a
concern if you follow this approach.

Thanks again!


On Thu, Feb 6, 2014 at 4:19 PM, Michael Bayer <mike...@zzzcomputing.com>wrote:

>
> On Feb 6, 2014, at 6:59 PM, Erich Blume <blume.er...@gmail.com> wrote:
>
> > Hmm, this one has me stumped. As best I can tell after poking at it
> using the column_reflect event, a custom dialect, etc. - the issue here is
> that in pysqlite.py we (in my Python 3.3 install) are selecting
> `sqlite3.dbapi2` as the dbapi interface, but we aren't telling sqlite3
> anything about how to treat unicode errors. From what I am reading (but it
> seems inconsistent, maybe?) sqlite3 automatically decodes all database
> retrieved values from their bytes for text fields, returning unicode
> strings. Except... that doesn't always seem to be true. I hex-edited a db
> file to change the utf-8 string "hello" to "hell" + 0x92 and sqlite3
> switched from returning "hello" to b"hell\x92", or something like that -
> I've been poking at this for so long I've lost track of that transcript.
> >
> > One can override sqlite3's text factory, apparently, with (for instance)
> `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`.
> Maybe the key is to try and find a way to trigger that from sqlalchemy? I
> tried and failed, maybe someone else can point me back to the path?
> >
> > Just to re-summarize the problem: In python 3, I'm getting errors trying
> to read a row from a sqlite database that has a TEXT column with an invalid
> utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just
> have sqlalchemy move along and ignore the byte, but I'm not clear how to do
> that.
>
> Pysqlite (e.g. sqlite3 module) returns TEXT as Python unicode out of the
> gate.  That exception message is being raised by sqlite3 itself, SQLAlchemy
> is just a pass-through, as the string type knows on sqlite that the value
> is already unicode.
>
> you might need to CAST() the value as BINARY perhaps, not sure.   You’d
> first want to get a plain sqlite3 script to do what you want.   Setting a
> “text_factory” at the module level of sqlite3 is certainly easy enough but
> that seems way too broad.   Ideally you’d want to be able to get the value
> on a per-column basis.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to