Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-09 Thread Erich Blume
Thanks for the help on this, everyone! I found two ways to deal with this
and figured I should share in case it comes up in the future.

The first approach, the one I went with (because in my case, fidelity was
not as important) was to alter the 'text_factory' the sqlite3 uses. One
trick here was that to access this parameter for a sqlalchemy connection
object 'conn' as returned from an engine, I had to do it like so:

conn.connection.connection.text_factory

The extra indirection on .connection* was because the connection's internal
dbapi connection was in fact a proy '_ConnectionFairy' instance. A small
issue but one that's difficult to notice at first as it raises no error to
set a new property on an object that will never use that property.

The second approach I found was more involved but would be more appropriate
for large projects that didn't want to have this behavior on every column.
In this approach you still change the text_factory of sqlite3's connection,
but instead change it to `bytes` (or perhaps an identity? ie lambda x: x -
not sure what is best). This will cause sqlite3 to return encoded utf-8
bytes instead of unicode strings.

Then, you have to tell SQLAlchemy to convert these strings to unicode. I
did not persue this approach far enough to find the right set of arguments
but I imagine this would be very simple - set 'force_unicode' to True, I
suspect, would be all you would need.

Finally, for the column with the invalid utf-8 sequences, just also set the
`unicode_error` to your preferred resolution strategy - usually 'ignore' or
'replace'.

I suppose it is possible that this could incur a performance penalty - the
sqlite3 de/encoding process is done in a compiled C module and as such
could possibly be faster than using native python for the task. I suspect
though that the module is just calling to the usual Python library
functions for encoding/decoding (but did not check). So that may be of a
concern if you follow this approach.

Thanks again!


On Thu, Feb 6, 2014 at 4:19 PM, Michael Bayer mike...@zzzcomputing.comwrote:


 On Feb 6, 2014, at 6:59 PM, Erich Blume blume.er...@gmail.com wrote:

  Hmm, this one has me stumped. As best I can tell after poking at it
 using the column_reflect event, a custom dialect, etc. - the issue here is
 that in pysqlite.py we (in my Python 3.3 install) are selecting
 `sqlite3.dbapi2` as the dbapi interface, but we aren't telling sqlite3
 anything about how to treat unicode errors. From what I am reading (but it
 seems inconsistent, maybe?) sqlite3 automatically decodes all database
 retrieved values from their bytes for text fields, returning unicode
 strings. Except... that doesn't always seem to be true. I hex-edited a db
 file to change the utf-8 string hello to hell + 0x92 and sqlite3
 switched from returning hello to bhell\x92, or something like that -
 I've been poking at this for so long I've lost track of that transcript.
 
  One can override sqlite3's text factory, apparently, with (for instance)
 `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`.
 Maybe the key is to try and find a way to trigger that from sqlalchemy? I
 tried and failed, maybe someone else can point me back to the path?
 
  Just to re-summarize the problem: In python 3, I'm getting errors trying
 to read a row from a sqlite database that has a TEXT column with an invalid
 utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just
 have sqlalchemy move along and ignore the byte, but I'm not clear how to do
 that.

 Pysqlite (e.g. sqlite3 module) returns TEXT as Python unicode out of the
 gate.  That exception message is being raised by sqlite3 itself, SQLAlchemy
 is just a pass-through, as the string type knows on sqlite that the value
 is already unicode.

 you might need to CAST() the value as BINARY perhaps, not sure.   You’d
 first want to get a plain sqlite3 script to do what you want.   Setting a
 “text_factory” at the module level of sqlite3 is certainly easy enough but
 that seems way too broad.   Ideally you’d want to be able to get the value
 on a per-column basis.




-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-09 Thread Michael Bayer

On Feb 9, 2014, at 5:34 AM, Erich Blume blume.er...@gmail.com wrote:

 
 Then, you have to tell SQLAlchemy to convert these strings to unicode. I did 
 not persue this approach far enough to find the right set of arguments but I 
 imagine this would be very simple - set 'force_unicode' to True, I suspect, 
 would be all you would need.

if the SQLite connection is set up to return bytes ahead of when SQLAlchemy 
does anything with the connection, this will be automatic.  DBAPIs vary so much 
in this regard that we test the connection when the dialect first connects.   
but if you’re using connection events to achieve this, the event needs to be 
set in a certain way at the moment to make sure you get the connection before 
sqlalchemy does anything with it, there’s a ticket to document that.

 
 Finally, for the column with the invalid utf-8 sequences, just also set the 
 `unicode_error` to your preferred resolution strategy - usually 'ignore' or 
 'replace'.
 
 I suppose it is possible that this could incur a performance penalty - the 
 sqlite3 de/encoding process is done in a compiled C module and as such could 
 possibly be faster than using native python for the task.

SQLAlchemy’s C extensions do the encoding/decoding and this process has been 
enhanced in 0.9 to also take the job of an expensive and sometimes-necessary 
“check if it’s already unicode” step.I’ve already observed that SQLA’s C 
exts seem to be faster than MySQLdb’s “use_unicode”, Postgresqls unicode 
extension (which is unfortunate, we use that anyway) and using a unicode type 
with a cx_oracle outputtypehandler (which we’ve also stopped using as users 
complained about performance).


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-06 Thread Erich Blume
Hmm, this one has me stumped. As best I can tell after poking at it using 
the column_reflect event, a custom dialect, etc. - the issue here is that 
in pysqlite.py we (in my Python 3.3 install) are selecting `sqlite3.dbapi2` 
as the dbapi interface, but we aren't telling sqlite3 anything about how to 
treat unicode errors. From what I am reading (but it seems inconsistent, 
maybe?) sqlite3 automatically decodes all database retrieved values from 
their bytes for text fields, returning unicode strings. Except... that 
doesn't always seem to be true. I hex-edited a db file to change the utf-8 
string hello to hell + 0x92 and sqlite3 switched from returning hello 
to bhell\x92, or something like that - I've been poking at this for so 
long I've lost track of that transcript.

One can override sqlite3's text factory, apparently, with (for instance) 
`sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`. 
Maybe the key is to try and find a way to trigger that from sqlalchemy? I 
tried and failed, maybe someone else can point me back to the path?

Just to re-summarize the problem: In python 3, I'm getting errors trying to 
read a row from a sqlite database that has a TEXT column with an invalid 
utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just 
have sqlalchemy move along and ignore the byte, but I'm not clear how to do 
that.

On Tuesday, February 4, 2014 4:33:46 AM UTC-8, Simon King wrote:

 I've not done much with reflection, but perhaps you could use the 
 column_reflect event: 

snip 

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-06 Thread Michael Bayer

On Feb 6, 2014, at 6:59 PM, Erich Blume blume.er...@gmail.com wrote:

 Hmm, this one has me stumped. As best I can tell after poking at it using the 
 column_reflect event, a custom dialect, etc. - the issue here is that in 
 pysqlite.py we (in my Python 3.3 install) are selecting `sqlite3.dbapi2` as 
 the dbapi interface, but we aren't telling sqlite3 anything about how to 
 treat unicode errors. From what I am reading (but it seems inconsistent, 
 maybe?) sqlite3 automatically decodes all database retrieved values from 
 their bytes for text fields, returning unicode strings. Except... that 
 doesn't always seem to be true. I hex-edited a db file to change the utf-8 
 string hello to hell + 0x92 and sqlite3 switched from returning hello 
 to bhell\x92, or something like that - I've been poking at this for so long 
 I've lost track of that transcript.
 
 One can override sqlite3's text factory, apparently, with (for instance) 
 `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`. Maybe 
 the key is to try and find a way to trigger that from sqlalchemy? I tried and 
 failed, maybe someone else can point me back to the path?
 
 Just to re-summarize the problem: In python 3, I'm getting errors trying to 
 read a row from a sqlite database that has a TEXT column with an invalid 
 utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just 
 have sqlalchemy move along and ignore the byte, but I'm not clear how to do 
 that.

Pysqlite (e.g. sqlite3 module) returns TEXT as Python unicode out of the gate.  
That exception message is being raised by sqlite3 itself, SQLAlchemy is just a 
pass-through, as the string type knows on sqlite that the value is already 
unicode.

you might need to CAST() the value as BINARY perhaps, not sure.   You’d first 
want to get a plain sqlite3 script to do what you want.   Setting a 
“text_factory” at the module level of sqlite3 is certainly easy enough but that 
seems way too broad.   Ideally you’d want to be able to get the value on a 
per-column basis.




signature.asc
Description: Message signed with OpenPGP using GPGMail


[sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-04 Thread Erich Blume
I am working on a binding to a SQLite database that I do not control the 
creation of, with the aid of reflection. I'm running in to what I believe 
are very basic UTF-8 decoding errors. For instance, a TEXT cell has the 
byte '0x92' in it and is causing an OperationalError. Presumably, this is 
because 0x92 (by itself) is not a valid encoding for any Unicode code 
point. I would prefer that the decoding from UTF-8 to be forced, perhaps by 
dropping the bad byte. How can I do this?

The database has a table with a column called 'description', which is of 
type TEXT. The PRAGMA encoding is left at 'UTF-8', thank goodness. One of 
the rows, however, contains within its otherwise ascii byte contents the 
singleton byte '0x92'. Based on the context of the sentence, it seems that 
this was intended to be encoded as a single quotation mark, some googling 
suggests 'RIGHT SINGLE QUOTATION MARK' in unicode, which is '0xE2 0x80 
0x99'. I gather that MSSQL (which was the original source of the data in 
this database) uses Microsofts' infernal web encodings sometimes and that 
is probably the source of this byte.

The issue is this: I really need to read this data! It would be *ideal* to 
have the aid of something like python's 'replace' decoding handler but 
failing that just eliding the byte would do fine in a pinch.

When fetching this row in Python 3.3 with SQLAlchemy 0.9.1 my session looks 
vaguely like this (with the text and stack trace truncated out for brevity).

  File 
/usr/local/Cellar/python3/3.3.3/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/sqlalchemy/engine/result.py,
 
line 760, in listcomp
return [process_row(metadata, row, processors, keymap)
  sqlalchemy.exc.OperationalError: (OperationalError) Could not decode 
to UTF-8 column 'description' with text ...

Is there some way to accomplish this?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-04 Thread Erich Blume
Thanks Simon,

Do you know how I might use that with reflection? There's several hundred
of these columns, I'd hate to have to override each one individually - that
sort of defeats the purpose of reflection.

One thought I just had was perhaps I could subclass the Text type and then
override the ischema_names for SQLite for TEXT type. That'd do the trick, I
suspect!


On Tue, Feb 4, 2014 at 3:26 AM, Simon King si...@simonking.org.uk wrote:

 On Tue, Feb 4, 2014 at 10:15 AM, Erich Blume blume.er...@gmail.com
 wrote:
  I am working on a binding to a SQLite database that I do not control the
  creation of, with the aid of reflection. I'm running in to what I believe
  are very basic UTF-8 decoding errors. For instance, a TEXT cell has the
 byte
  '0x92' in it and is causing an OperationalError. Presumably, this is
 because
  0x92 (by itself) is not a valid encoding for any Unicode code point. I
 would
  prefer that the decoding from UTF-8 to be forced, perhaps by dropping the
  bad byte. How can I do this?
 
  The database has a table with a column called 'description', which is of
  type TEXT. The PRAGMA encoding is left at 'UTF-8', thank goodness. One
 of
  the rows, however, contains within its otherwise ascii byte contents the
  singleton byte '0x92'. Based on the context of the sentence, it seems
 that
  this was intended to be encoded as a single quotation mark, some googling
  suggests 'RIGHT SINGLE QUOTATION MARK' in unicode, which is '0xE2 0x80
  0x99'. I gather that MSSQL (which was the original source of the data in
  this database) uses Microsofts' infernal web encodings sometimes and
 that is
  probably the source of this byte.
 
  The issue is this: I really need to read this data! It would be *ideal*
 to
  have the aid of something like python's 'replace' decoding handler but
  failing that just eliding the byte would do fine in a pinch.
 
  When fetching this row in Python 3.3 with SQLAlchemy 0.9.1 my session
 looks
  vaguely like this (with the text and stack trace truncated out for
 brevity).
 
File
 
 /usr/local/Cellar/python3/3.3.3/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/sqlalchemy/engine/result.py,
  line 760, in listcomp
  return [process_row(metadata, row, processors, keymap)
sqlalchemy.exc.OperationalError: (OperationalError) Could not
 decode
  to UTF-8 column 'description' with text ...
 
  Is there some way to accomplish this?
 

 The String-related column types have a unicode_error parameter which
 sounds like it might be what you want:


 http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.String.params.unicode_error

 Note the various warnings around it though...

 Hope that helps,

 Simon

 --
 You received this message because you are subscribed to a topic in the
 Google Groups sqlalchemy group.
 To unsubscribe from this topic, visit
 https://groups.google.com/d/topic/sqlalchemy/T--Ftk5EVZg/unsubscribe.
 To unsubscribe from this group and all its topics, send an email to
 sqlalchemy+unsubscr...@googlegroups.com.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 Visit this group at http://groups.google.com/group/sqlalchemy.
 For more options, visit https://groups.google.com/groups/opt_out.


-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [sqlalchemy] SQLite: OperationalError when decoding 0x92 in TEXT column

2014-02-04 Thread Simon King
I've not done much with reflection, but perhaps you could use the
column_reflect event:

  
http://docs.sqlalchemy.org/en/rel_0_8/core/events.html#sqlalchemy.events.DDLEvents.column_reflect

Simon

On Tue, Feb 4, 2014 at 11:28 AM, Erich Blume blume.er...@gmail.com wrote:
 Thanks Simon,

 Do you know how I might use that with reflection? There's several hundred of
 these columns, I'd hate to have to override each one individually - that
 sort of defeats the purpose of reflection.

 One thought I just had was perhaps I could subclass the Text type and then
 override the ischema_names for SQLite for TEXT type. That'd do the trick, I
 suspect!


 On Tue, Feb 4, 2014 at 3:26 AM, Simon King si...@simonking.org.uk wrote:

 On Tue, Feb 4, 2014 at 10:15 AM, Erich Blume blume.er...@gmail.com
 wrote:
  I am working on a binding to a SQLite database that I do not control the
  creation of, with the aid of reflection. I'm running in to what I
  believe
  are very basic UTF-8 decoding errors. For instance, a TEXT cell has the
  byte
  '0x92' in it and is causing an OperationalError. Presumably, this is
  because
  0x92 (by itself) is not a valid encoding for any Unicode code point. I
  would
  prefer that the decoding from UTF-8 to be forced, perhaps by dropping
  the
  bad byte. How can I do this?
 
  The database has a table with a column called 'description', which is of
  type TEXT. The PRAGMA encoding is left at 'UTF-8', thank goodness. One
  of
  the rows, however, contains within its otherwise ascii byte contents the
  singleton byte '0x92'. Based on the context of the sentence, it seems
  that
  this was intended to be encoded as a single quotation mark, some
  googling
  suggests 'RIGHT SINGLE QUOTATION MARK' in unicode, which is '0xE2 0x80
  0x99'. I gather that MSSQL (which was the original source of the data in
  this database) uses Microsofts' infernal web encodings sometimes and
  that is
  probably the source of this byte.
 
  The issue is this: I really need to read this data! It would be *ideal*
  to
  have the aid of something like python's 'replace' decoding handler but
  failing that just eliding the byte would do fine in a pinch.
 
  When fetching this row in Python 3.3 with SQLAlchemy 0.9.1 my session
  looks
  vaguely like this (with the text and stack trace truncated out for
  brevity).
 
File
 
  /usr/local/Cellar/python3/3.3.3/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/sqlalchemy/engine/result.py,
  line 760, in listcomp
  return [process_row(metadata, row, processors, keymap)
sqlalchemy.exc.OperationalError: (OperationalError) Could not
  decode
  to UTF-8 column 'description' with text ...
 
  Is there some way to accomplish this?
 

 The String-related column types have a unicode_error parameter which
 sounds like it might be what you want:


 http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.String.params.unicode_error

 Note the various warnings around it though...

 Hope that helps,

 Simon

 --
 You received this message because you are subscribed to a topic in the
 Google Groups sqlalchemy group.
 To unsubscribe from this topic, visit
 https://groups.google.com/d/topic/sqlalchemy/T--Ftk5EVZg/unsubscribe.
 To unsubscribe from this group and all its topics, send an email to
 sqlalchemy+unsubscr...@googlegroups.com.

 To post to this group, send email to sqlalchemy@googlegroups.com.
 Visit this group at http://groups.google.com/group/sqlalchemy.
 For more options, visit https://groups.google.com/groups/opt_out.


 --
 You received this message because you are subscribed to the Google Groups
 sqlalchemy group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to sqlalchemy+unsubscr...@googlegroups.com.
 To post to this group, send email to sqlalchemy@googlegroups.com.
 Visit this group at http://groups.google.com/group/sqlalchemy.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
sqlalchemy group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To post to this group, send email to sqlalchemy@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.