hi Hatem, Are you talking about the UTF8 ConvertedType in Parquet?
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52 AFAIK we do respect that if it is set, otherwise we do not guess https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65 - Wes On Wed, Feb 6, 2019 at 7:07 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote: > > Thanks Antoine, that makes good sense. > > We are writing string data using the utf8 data type. This question came up > when trying to read this fastparquet project test file into arrow memory: > > fastparquet/test-data/nation.dict.parquet > > The name and comment columns results in a binary data type. Digging into the > file it seems like these columns are lacking the utf8 type so that makes > sense. I wonder if this is somehow a "vintage" parquet file and more recent > parquet writers will include the UTF8 type. > > On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote: > > > Hi Hatem, > > It is intended that the convention is application-dependent. From > Arrow's point of view, the binary string is an opaque blob of data. > Depending on your application, it might be an UTF16-encoded piece of > text, a JPEG image, anything. > > By the way, if you store ASCII text data, I would recommend using the > utf8 type, since the UTF-8 encoding is a superset of ASCII. > > Regards > > Antoine. > > > Le 06/02/2019 à 11:34, Hatem Helal a écrit : > > Hi all, > > > > I wanted to make sure I understood the distinction/use cases for > choosing between the utf8 and binary logical types. > > > > Based on this doc > <https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary> > > > > * Utf8 data is Unicode values with UTF-8 encoding > > * Binary is any other variable length bytes > > > > I wonder what is the correct way to consume a binary array. It seems > like a binary array is likely representing some string data but without the > encoding it isn't not clear how to safely interpret it. Is there a > convention (e.g. assume a binary type is ASCII encoded) that we can follow? > > > > Many thanks, > > > > Hatem > > > >