Hello,

Le 02/09/2019 à 10:39, Kenta Murata a écrit :
> 
> There are two options to manage a character encoding in a BinaryArray.
> The first way is introducing an optional character_encoding field in
> BinaryType.  The second way is using custom_metadata field to supply
> the character encoding name.

I am against parameterizing Binary types with the character encoding.
Binary data in a Binary array is opaque, and the type should reflect
that.  Its application-dependent meaning should be (optionally) encoded
in the metadata.  For example, at some point people might also want to
add a "mime-type" metadata key.

So I think the second solution (defining a well-known metadata key) is fine.

> If we use custom_metadata, we should decide the key for this
> information.  I guess “charset” is good candidates for the key because
> it is widely used for specifying what a character encoding is used.

Or perhaps "ARROW:charset"?
(I think I prefer "encoding" rather than "charset" personally...)

> The value must be the name of a character encoding, such as “UTF-8”
> and “Windows-31J”.  It is better if we can decide canonical encoding
> names, but I guess it is hard work because many systems use the same
> name for the different encodings.

I don't think encoding name canonicalization is Arrow's concern. Each
system has its rules and aliases.  And I doubt we're willing to
implement string processing algorithms for encodings other than UTF-8.

Regards

Antoine.

Reply via email to