Thanks Antoine, that makes good sense. We are writing string data using the utf8 data type. This question came up when trying to read this fastparquet project test file into arrow memory:
fastparquet/test-data/nation.dict.parquet The name and comment columns results in a binary data type. Digging into the file it seems like these columns are lacking the utf8 type so that makes sense. I wonder if this is somehow a "vintage" parquet file and more recent parquet writers will include the UTF8 type. On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote: Hi Hatem, It is intended that the convention is application-dependent. From Arrow's point of view, the binary string is an opaque blob of data. Depending on your application, it might be an UTF16-encoded piece of text, a JPEG image, anything. By the way, if you store ASCII text data, I would recommend using the utf8 type, since the UTF-8 encoding is a superset of ASCII. Regards Antoine. Le 06/02/2019 à 11:34, Hatem Helal a écrit : > Hi all, > > I wanted to make sure I understood the distinction/use cases for choosing between the utf8 and binary logical types. > > Based on this doc <https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary> > > * Utf8 data is Unicode values with UTF-8 encoding > * Binary is any other variable length bytes > > I wonder what is the correct way to consume a binary array. It seems like a binary array is likely representing some string data but without the encoding it isn't not clear how to safely interpret it. Is there a convention (e.g. assume a binary type is ASCII encoded) that we can follow? > > Many thanks, > > Hatem >