On Nov 17, 2010, at 10:26 AM, Guilherme Menezes wrote:

> Hi,
> 
> I am having an encoding issue using Python 2.6.2, SqlAlchemy 0.6.5 and
> cxOracle 5.0.4 to access an Oracle 11g database.
> 
> I am using the NLS_LANG=.AL32UTF8 enviroment variable. My table in
> Oracle is correctly configured to accept Unicode.
> 
> First, I compiled cxOracle without the WITH_UNICODE flag (as
> recommended by SqlAlchemy).  I was passing data to SqlAlchemy as
> unicode() objects. Everything worked without errors or warnings.
> 
> However, sometimes Oracle would complain that the string I was trying
> to insert into a VARCHAR2 field was too big (> 4000), even when the
> string size ( len(the_string.encode('utf-8')) ) was about 2300 bytes
> (before SqlAlchemy). I used a sniffer to verify that the Oracle client
> was sending two bytes for each character (even the ASCII ones),
> instead of sending two bytes only for special characters.
> 
> I repeated this insertion experiment using cx_Oracle directly and
> passing unicode() objects to it. Same thing: no errors or warnings,
> and again the sniffer shown that the Oracle client was sending two
> bytes per character. It seemed to me that cx_Oracle does not really
> supports unicode() objects; it accepts them but it does not encode()
> them to the correct encoding (as set in NLS_LANG variable).
> 
> However, when reading data from Oracle everything worked as expected
> (i.e. someone was converting the UTF-16(?) string to UTF-8, probably
> Oracle itself).
> 
> I tried two approaches to make the Oracle client send data in UTF-8
> (not UTF-16):
> 
> 1. Passing str() objects to SqlAlchemy. SqlAlchemy would then complain
> in a warning:
> * UnicodeWarning: Unicode equal comparison failed to convert both
> arguments to Unicode - interpreting them as being unequal *
> It worked, but it did not seem right, since I was receiving unicode()
> objects and sending str() objetcs to SqlAlchemy. Moreover, I did not
> know how things work inside SqlAlchemy, so I tried something else.
> 
> 2. Compiling cx_Oracle with the WITH_UNICODE flag solved the problem
> (utf-8 was sent to Oracle), but another warning was issued:
> * SAWarning: cx_Oracle is compiled under Python 2.xx using the
> WITH_UNICODE flag.  Consider recompiling cx_Oracle without this flag,
> which is in no way necessary for full support of Unicode. Otherwise,
> all string-holding bind parameters must be explicitly typed using
> SQLAlchemy's String type or one of its subtypes,or otherwise be passed
> as Python unicode.  Plain Python strings passed as bind parameters
> will be silently corrupted by cx_Oracle. *
> 
> Right now I am using cx_Oracle compiled with WITH_UNICODE, everything
> works, but this warning is worrying me. I tried to pass a str() object
> in UTF-8 to SqlAlchemy, and the execution halted with an encoding
> error (UnicodeDecodeError). In other words, cx_Oracle did not silently
> corrupted the data (at least when special characters were sent).
> 
> I would like to know if anyone has a better (safer) solution to my
> problem. Or am I safe enough by using the WITH_UNICODE flag and
> passing only unicode() objects?


Regarding the passing of utf-8 vs. utf-16 and the differences in WITH_UNICODE 
mode, you should really email the cx_oracle list about that as WITH_UNICODE to 
my knowledge only has to do with the Python interpretation of arguments, not 
its interaction with OCI.   If you email their list, make sure all code 
examples are purely derived from cx_oracle, to eliminate any doubt that each 
behavior is one of cx_oracle directly.  WITH_UNICODE was not intended for 
general use in Python 2.xx.   if your whole application really works with it, 
then there's no reason not to use it, its just that you'll always have to 
ensure that no non-Python unicode strings ever get sent to cx_oracle which can 
be fairly tedious.   It sounds like a bug that cx_oracle would be expanding 
into a two-byte-per-character stream like that.

Its possible that calling setinputsizes() with cx_oracle.UNICODE may be the key 
to cx_oracle's behavior here.   We call setinputsizes for every Oracle cursor 
but we currently exclude the string types from that list as it had some 
unwanted side effects.

Also note that VARCHAR2(4000) is measuring itself in bytes, not characters.  
For true Unicode support Oracle provides the NVARCHAR2 type, where the "length" 
specifies the number of characters, instead of bytes.   Recent versions of 
Oracle also support the form VARCHAR2(4000 CHAR) which will similarly measure 
the column in terms of characters based on the databases encoding instead of 
bytes.    Its worth investigating if using NVARCHAR2 causes cx_oracle, or OCI, 
to change its behavior.


> 
> Thank you in advance.
> 
> Guilherme.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
> To post to this group, send email to sqlalch...@googlegroups.com.
> To unsubscribe from this group, send email to 
> sqlalchemy+unsubscr...@googlegroups.com.
> For more options, visit this group at 
> http://groups.google.com/group/sqlalchemy?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalch...@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

Reply via email to