hi Hatem,

Are you talking about the UTF8 ConvertedType in Parquet?

https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52

AFAIK we do respect that if it is set, otherwise we do not guess

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65

- Wes

On Wed, Feb 6, 2019 at 7:07 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote:
>
> Thanks Antoine, that makes good sense.
>
> We are writing string data using the utf8 data type.  This question came up 
> when trying to read this fastparquet project test file into arrow memory:
>
>         fastparquet/test-data/nation.dict.parquet
>
> The name and comment columns results in a binary data type.  Digging into the 
> file it seems like these columns are lacking the utf8 type so that makes 
> sense.  I wonder if this is somehow a "vintage" parquet file and more recent 
> parquet writers will include the UTF8 type.
>
> On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote:
>
>
>     Hi Hatem,
>
>     It is intended that the convention is application-dependent.  From
>     Arrow's point of view, the binary string is an opaque blob of data.
>     Depending on your application, it might be an UTF16-encoded piece of
>     text, a JPEG image, anything.
>
>     By the way, if you store ASCII text data, I would recommend using the
>     utf8 type, since the UTF-8 encoding is a superset of ASCII.
>
>     Regards
>
>     Antoine.
>
>
>     Le 06/02/2019 à 11:34, Hatem Helal a écrit :
>     > Hi all,
>     >
>     > I wanted to make sure I understood the distinction/use cases for 
> choosing between the utf8 and binary logical types.
>     >
>     > Based on this doc 
> <https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary>
>     >
>     > * Utf8 data is Unicode values with UTF-8 encoding
>     > * Binary is any other variable length bytes
>     >
>     > I wonder what is the correct way to consume a binary array.  It seems 
> like a binary array is likely representing some string data but without the 
> encoding it isn't not clear how to safely interpret it.  Is there a 
> convention (e.g. assume a binary type is ASCII encoded) that we can follow?
>     >
>     > Many thanks,
>     >
>     > Hatem
>     >
>
>

Reply via email to