Thanks Antoine, that makes good sense.
We are writing string data using the utf8 data type. This question came up
when trying to read this fastparquet project test file into arrow memory:
fastparquet/test-data/nation.dict.parquet
The name and comment columns results in a binary data type. Digging into the
file it seems like these columns are lacking the utf8 type so that makes sense.
I wonder if this is somehow a "vintage" parquet file and more recent parquet
writers will include the UTF8 type.
On 2/6/19, 11:54 AM, "Antoine Pitrou" <[email protected]> wrote:
Hi Hatem,
It is intended that the convention is application-dependent. From
Arrow's point of view, the binary string is an opaque blob of data.
Depending on your application, it might be an UTF16-encoded piece of
text, a JPEG image, anything.
By the way, if you store ASCII text data, I would recommend using the
utf8 type, since the UTF-8 encoding is a superset of ASCII.
Regards
Antoine.
Le 06/02/2019 à 11:34, Hatem Helal a écrit :
> Hi all,
>
> I wanted to make sure I understood the distinction/use cases for choosing
between the utf8 and binary logical types.
>
> Based on this doc
<https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary>
>
> * Utf8 data is Unicode values with UTF-8 encoding
> * Binary is any other variable length bytes
>
> I wonder what is the correct way to consume a binary array. It seems
like a binary array is likely representing some string data but without the
encoding it isn't not clear how to safely interpret it. Is there a convention
(e.g. assume a binary type is ASCII encoded) that we can follow?
>
> Many thanks,
>
> Hatem
>