Edmon Begoli created DRILL-3712:
-----------------------------------

             Summary: Drill does not recognize UTF-16-LE encoding
                 Key: DRILL-3712
                 URL: https://issues.apache.org/jira/browse/DRILL-3712
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Text & CSV
    Affects Versions: 1.1.0
         Environment: OSX, likely Linux. 
            Reporter: Edmon Begoli
            Assignee: Steven Phillips


We are unable to process files that OSX identifies as character sete UTF16LE.  
After unzipping and converting to UTF8, we are able to process one fine.  There 
are CONVERT_TO and CONVERT_FROM commands that appear to address the issue, but 
we were unable to make them work on a gzipped or unzipped version of the UTF16 
file.  We were  able to use CONVERT_FROM ok, but when we tried to wrap the 
results of that to cast as a date, or anything else, it failed.  Trying to work 
with it natively caused the double-byte nature to appear (a substring 1,4 only 
return the first two characters).

I cannot post the data because it is proprietary in nature, but I am posting 
this code that might be useful in re-creating an issue:


#!/usr/bin/env python
""" Generates a test psv file with some text fields encoded as UTF-16-LE. """
def write_utf16le_encoded_psv():
        total_lines = 10
        encoded = "Encoded B".encode("utf-16-le")
        with open("test.psv","wb") as csv_file:
                csv_file.write("header 1|header 2|header 3\n")
                for i in xrange(total_lines):
                                csv_file.write("value 
A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")

if __name__ == "__main__":
        write_utf16le_encoded_psv()


then:

tar zcvf test.psv






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to