Re: UTF-8 and Binary logical types

Hatem Helal Wed, 06 Feb 2019 08:44:42 -0800

Hi Wes,

Yes, the UTF8 ConvertedType is what I was after.  Thanks for the helpful 
references.


I don't have a good feel for how common this is but the following test file 
caused my confusion between UTF8 and Binary types in Arrow:

https://github.com/dask/fastparquet/blob/master/test-data/nation.dict.parquet

I debugged this and found that the ConvertedType isn't set for the columns 
containing string data which results in using LogicalType::NONE, here:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.cc#L310

I haven't found other files that exhibit this behavior which I think is ok now 
that I understand it a bit more.

Hatem

On 2/6/19, 3:37 PM, "Wes McKinney" <wesmck...@gmail.com> wrote:

    hi Hatem,
    
    Are you talking about the UTF8 ConvertedType in Parquet?
    
    
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52
    
    AFAIK we do respect that if it is set, otherwise we do not guess
    
    
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65
    
    - Wes
    
    On Wed, Feb 6, 2019 at 7:07 AM Hatem Helal <hatem.he...@mathworks.co.uk> 
wrote:
    >
    > Thanks Antoine, that makes good sense.
    >
    > We are writing string data using the utf8 data type.  This question came 
up when trying to read this fastparquet project test file into arrow memory:
    >
    >         fastparquet/test-data/nation.dict.parquet
    >
    > The name and comment columns results in a binary data type.  Digging into 
the file it seems like these columns are lacking the utf8 type so that makes 
sense.  I wonder if this is somehow a "vintage" parquet file and more recent 
parquet writers will include the UTF8 type.
    >
    > On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote:
    >
    >
    >     Hi Hatem,
    >
    >     It is intended that the convention is application-dependent.  From
    >     Arrow's point of view, the binary string is an opaque blob of data.
    >     Depending on your application, it might be an UTF16-encoded piece of
    >     text, a JPEG image, anything.
    >
    >     By the way, if you store ASCII text data, I would recommend using the
    >     utf8 type, since the UTF-8 encoding is a superset of ASCII.
    >
    >     Regards
    >
    >     Antoine.
    >
    >
    >     Le 06/02/2019 à 11:34, Hatem Helal a écrit :
    >     > Hi all,
    >     >
    >     > I wanted to make sure I understood the distinction/use cases for 
choosing between the utf8 and binary logical types.
    >     >
    >     > Based on this doc 
<https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary>
    >     >
    >     > * Utf8 data is Unicode values with UTF-8 encoding
    >     > * Binary is any other variable length bytes
    >     >
    >     > I wonder what is the correct way to consume a binary array.  It 
seems like a binary array is likely representing some string data but without 
the encoding it isn't not clear how to safely interpret it.  Is there a 
convention (e.g. assume a binary type is ASCII encoded) that we can follow?
    >     >
    >     > Many thanks,
    >     >
    >     > Hatem
    >     >
    >
    >

Re: UTF-8 and Binary logical types

Reply via email to