Hi Wes, Yes, the UTF8 ConvertedType is what I was after. Thanks for the helpful references.
I don't have a good feel for how common this is but the following test file caused my confusion between UTF8 and Binary types in Arrow: https://github.com/dask/fastparquet/blob/master/test-data/nation.dict.parquet I debugged this and found that the ConvertedType isn't set for the columns containing string data which results in using LogicalType::NONE, here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/schema.cc#L310 I haven't found other files that exhibit this behavior which I think is ok now that I understand it a bit more. Hatem On 2/6/19, 3:37 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: hi Hatem, Are you talking about the UTF8 ConvertedType in Parquet? https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52 AFAIK we do respect that if it is set, otherwise we do not guess https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65 - Wes On Wed, Feb 6, 2019 at 7:07 AM Hatem Helal <hatem.he...@mathworks.co.uk> wrote: > > Thanks Antoine, that makes good sense. > > We are writing string data using the utf8 data type. This question came up when trying to read this fastparquet project test file into arrow memory: > > fastparquet/test-data/nation.dict.parquet > > The name and comment columns results in a binary data type. Digging into the file it seems like these columns are lacking the utf8 type so that makes sense. I wonder if this is somehow a "vintage" parquet file and more recent parquet writers will include the UTF8 type. > > On 2/6/19, 11:54 AM, "Antoine Pitrou" <anto...@python.org> wrote: > > > Hi Hatem, > > It is intended that the convention is application-dependent. From > Arrow's point of view, the binary string is an opaque blob of data. > Depending on your application, it might be an UTF16-encoded piece of > text, a JPEG image, anything. > > By the way, if you store ASCII text data, I would recommend using the > utf8 type, since the UTF-8 encoding is a superset of ASCII. > > Regards > > Antoine. > > > Le 06/02/2019 à 11:34, Hatem Helal a écrit : > > Hi all, > > > > I wanted to make sure I understood the distinction/use cases for choosing between the utf8 and binary logical types. > > > > Based on this doc <https://arrow.apache.org/docs/format/Metadata.html#utf8-and-binary> > > > > * Utf8 data is Unicode values with UTF-8 encoding > > * Binary is any other variable length bytes > > > > I wonder what is the correct way to consume a binary array. It seems like a binary array is likely representing some string data but without the encoding it isn't not clear how to safely interpret it. Is there a convention (e.g. assume a binary type is ASCII encoded) that we can follow? > > > > Many thanks, > > > > Hatem > > > >