STRING data corruption in internationalized data -- based on LANG env variable
------------------------------------------------------------------------------

                 Key: HIVE-2859
                 URL: https://issues.apache.org/jira/browse/HIVE-2859
             Project: Hive
          Issue Type: Bug
          Components: Configuration, Import/Export, Serializers/Deserializers, 
Types
    Affects Versions: 0.7.1
         Environment: Windows / RHEL5 with LANG = en_US.CP1252
            Reporter: John Gordon
             Fix For: 0.9.0, 0.7.1


This is a bug in Hive that is exacerbated by replatforming it to Windows 
without CYGWIN.  Basically, it assumes that the default file.encoding is UTF8.  
There are something like 6-7 getBytes() calls and write() calls that don't 
specify the encoding.  The rest specify UTF-8 explicitly, which blocks 
auto-detection of UTF-16 data in files with a BOM present.  The mix of explicit 
encodings and default encoding assumptions means that Hive must be run in a JVM 
whose default encoding is UTF-8 and only UTF-8.
 
When the JVM starts up, it derives the default encoding from the C runtime 
setlocale() call.  On Linux/Unix, this would use the LANG env variable (which 
is almost always <locale>.UTF8 for machines handling internationalized data, 
but not guaranteed to be so).  On Windows, this is derived from the user's 
language settings, and cannot return a UTF-8 encoding, right now.  So there 
isn't an environment setting for Windows that would reliably provide the JVM 
with a set of inputs to cause it to set the default encoding to UTF-8 on 
startup without additional options.

However, there are 2 feasible options: 
1.) the JVM has a startup option -Dfile.encoding=UTF-8 which should explicitly 
override the default encoding detection behavior  in the JVM to make it always 
UTF-8 regardless of the environmental configuration.  This would make all 
deployments on all OS/environment configs behave consistently.  I don't know 
where Hive sets the JVM options we use when it starts the service.
2.) We could add "UTF8" explicitly to all the remaining getBytes() calls that 
need it, and make all the string I/O explicitly UTF-8 encoded.  This is 
probably being changed right now as part of Hive-1505, so we would duplicate 
effort and maybe make that change harder.  Seems easier to trick the JVM into 
behaving like it is on a well-configured machine WRT default encoding instead 
of setting explicit encodings everywhere.
 
So:
-       Pretty much any globalized strings than Western European are going to 
be corrupted in the current Hive service on Windows with this bug present 
because there really isn't a way to have the JVM read the environment and 
determine by default that UTF8 should be the default encoding.
-       Anyone can repro this on Linux fairly easily -- Add "export 
LANG=en_US.CP1252" to /etc/profile to modify the global LANG default encoding 
to CP1252 explicitly, then restart the service and do a query over 
internationalized UTF-8 data.
-       We shouldn't rely on JVM default codepage selection if we want to 
support UTF-8 consistently and reliably as the default encoding.
-       The estimate can range wildly, but adding an explicit default encoding 
on startup should only take a little while if you know where to do it, 
theoretically.
-       I don't know where to update the start arguments of the JVM when the 
service is started, just getting into the code for the first time with this bug 
investigation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to