Hello,
Le 02/09/2019 à 10:39, Kenta Murata a écrit : > > There are two options to manage a character encoding in a BinaryArray. > The first way is introducing an optional character_encoding field in > BinaryType. The second way is using custom_metadata field to supply > the character encoding name. I am against parameterizing Binary types with the character encoding. Binary data in a Binary array is opaque, and the type should reflect that. Its application-dependent meaning should be (optionally) encoded in the metadata. For example, at some point people might also want to add a "mime-type" metadata key. So I think the second solution (defining a well-known metadata key) is fine. > If we use custom_metadata, we should decide the key for this > information. I guess “charset” is good candidates for the key because > it is widely used for specifying what a character encoding is used. Or perhaps "ARROW:charset"? (I think I prefer "encoding" rather than "charset" personally...) > The value must be the name of a character encoding, such as “UTF-8” > and “Windows-31J”. It is better if we can decide canonical encoding > names, but I guess it is hard work because many systems use the same > name for the different encodings. I don't think encoding name canonicalization is Arrow's concern. Each system has its rules and aliases. And I doubt we're willing to implement string processing algorithms for encodings other than UTF-8. Regards Antoine.