Re: UTF-8 and Binary logical types

2019-02-06 Thread Hatem Helal
Hi Wes, Yes, the UTF8 ConvertedType is what I was after. Thanks for the helpful references. I don't have a good feel for how common this is but the following test file caused my confusion between UTF8 and Binary types in Arrow:

Re: UTF-8 and Binary logical types

2019-02-06 Thread Wes McKinney
hi Hatem, Are you talking about the UTF8 ConvertedType in Parquet? https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift#L52 AFAIK we do respect that if it is set, otherwise we do not guess https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/schema.cc#L65 -

Re: UTF-8 and Binary logical types

2019-02-06 Thread Hatem Helal
Thanks Antoine, that makes good sense. We are writing string data using the utf8 data type. This question came up when trying to read this fastparquet project test file into arrow memory: fastparquet/test-data/nation.dict.parquet The name and comment columns results in a binary

Re: UTF-8 and Binary logical types

2019-02-06 Thread Antoine Pitrou
Hi Hatem, It is intended that the convention is application-dependent. From Arrow's point of view, the binary string is an opaque blob of data. Depending on your application, it might be an UTF16-encoded piece of text, a JPEG image, anything. By the way, if you store ASCII text data, I would

UTF-8 and Binary logical types

2019-02-06 Thread Hatem Helal
Hi all, I wanted to make sure I understood the distinction/use cases for choosing between the utf8 and binary logical types. Based on this doc * Utf8 data is Unicode values with UTF-8 encoding * Binary is any other variable