Re: UTF-8 and Binary logical types

Hatem Helal Wed, 06 Feb 2019 05:07:55 -0800

Thanks Antoine, that makes good sense.

We are writing string data using the utf8 data type.  This question came up 
when trying to read this fastparquet project test file into arrow memory:


        fastparquet/test-data/nation.dict.parquet
      
The name and comment columns results in a binary data type.  Digging into the 
file it seems like these columns are lacking the utf8 type so that makes sense. 
 I wonder if this is somehow a "vintage" parquet file and more recent parquet 
writers will include the UTF8 type.

On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote:

    
    Hi Hatem,
    
    It is intended that the convention is application-dependent.  From
    Arrow's point of view, the binary string is an opaque blob of data.
    Depending on your application, it might be an UTF16-encoded piece of
    text, a JPEG image, anything.
    
    By the way, if you store ASCII text data, I would recommend using the
    utf8 type, since the UTF-8 encoding is a superset of ASCII.
    
    Regards
    
    Antoine.
    
    
    Le 06/02/2019 à 11:34, Hatem Helal a écrit :
    > Hi all,
    > 
    > I wanted to make sure I understood the distinction/use cases for choosing 
between the utf8 and binary logical types.
    > 
    > Based on this doc 
<https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary>
    > 
    > * Utf8 data is Unicode values with UTF-8 encoding
    > * Binary is any other variable length bytes
    > 
    > I wonder what is the correct way to consume a binary array.  It seems 
like a binary array is likely representing some string data but without the 
encoding it isn't not clear how to safely interpret it.  Is there a convention 
(e.g. assume a binary type is ASCII encoded) that we can follow?
    > 
    > Many thanks,
    > 
    > Hatem
    >

Re: UTF-8 and Binary logical types

Reply via email to