Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

Wes McKinney Mon, 02 Sep 2019 07:51:28 -0700

hi Kenta,

It seems like using ExtensionType would be a simple way to handle this
for the immediate purpose of implementing user-facing Array types. If
we wanted to change the the metadata representation to something more
"built-in" then we can keep discussing this. It seems like having a
distinct DataType subclass and Array subclass for unicode-but-not-UTF8
would be useful as opposed to adding an encoding attribute to
BinaryType. Interested to know what you think about this solution.


- Wes

On Mon, Sep 2, 2019 at 3:40 AM Kenta Murata <m...@mrkn.jp> wrote:
>
> [Abstract]
> When we have a string data encoded in a character encoding other than
> UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
> doesn’t provide the way to specify what a character encoding used in a
> BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
> provides the way to manage a character encoding in a BinaryArray.
>
> I’d appreciate any comments or suggestions.
>
> [Long description]
> Apache Arrow has the specialized type for UTF-8 encoded string but
> doesn’t have types for other character encodings, such as ISO-8859-x
> and Shift_JIS. We need to manage what a character encoding is used in
> a binary string array, in the outside of the arrays such as metadata.
>
> In Datasets project, one of the goals is to support database
> protocols.  Some databases support a lot of character encodings in
> each manner.  For example, PostgreSQL supports to specify what a
> character encoding is used for each database, and MySQL allows us to
> specify character encodings separately for each level: database,
> table, and column.
>
> I have a concern about how does Apache Arrow provide the way to
> specify character encodings for values in arrays.
>
> If people can constrain to use UTF-8 for all the string data,
> StringArray is enough for them. But if they cannot unify the character
> encoding of string data in UTF-8, should Apache Arrow provides the
> standard way of the character encoding management?
>
> The example use of Apache Arrow in such case is an application to the
> internal data of OR mapper library, such as ActiveRecord of Ruby on
> Rails.
>
> My opinion is that Apache Arrow must have the standard way in both its
> format and its API.  The reason is below:
>
> (1) Currently, when we use MySQL or PostgreSQL as the data source of
> record batch streams, we will lose the information of character
> encodings the original data have
>
> (2) We need to struggle to support character encoding treatment on
> each combination of systems if we don’t have a standard way of
> character encoding management, though this is not fit to Apache
> Arrow’s philosophy
>
> (3) We cannot support character encoding treatment in the level of
> language-binding if Apache Arrow doesn’t provide the standard APIs of
> character encoding management
>
> There are two options to manage a character encoding in a BinaryArray.
> The first way is introducing an optional character_encoding field in
> BinaryType.  The second way is using custom_metadata field to supply
> the character encoding name.
>
> If we use custom_metadata, we should decide the key for this
> information.  I guess “charset” is good candidates for the key because
> it is widely used for specifying what a character encoding is used.
> The value must be the name of a character encoding, such as “UTF-8”
> and “Windows-31J”.  It is better if we can decide canonical encoding
> names, but I guess it is hard work because many systems use the same
> name for the different encodings.  For example, “Shift_JIS” means
> either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
> but the corresponding character sets are slightly different.  See the
> spreadsheet [1] for the correspondence of character encoding names
> between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
> standard of WhatWG [4].
>
> If we introduce the new optional field for the information of a
> character encoding in BinaryType, I recommend let this new field be a
> string to keep the name of a character encoding.  But it is possible
> to make the field integer and let it keep the enum value.  I don’t
> know there is a good standard for the enum value of character
> encodings.  IANA manages MIBenum [2], though the registered character
> encodings [3] are not enough for our requirement, I think.
>
> I prefer the second way because the first way can supply the
> information of character encoding only to a Field, not a BinaryArray.
>
> [1] 
> https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
> [2] https://tools.ietf.org/html/rfc3808
> [3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
> [4] https://encoding.spec.whatwg.org/
>
> --
> Kenta Murata

Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

Reply via email to