[DISCUSS][FORMAT] Concerning about character encoding of binary string data

Kenta Murata Mon, 02 Sep 2019 01:40:16 -0700

[Abstract]
When we have a string data encoded in a character encoding other than
UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
doesn’t provide the way to specify what a character encoding used in a
BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
provides the way to manage a character encoding in a BinaryArray.

I’d appreciate any comments or suggestions.

[Long description]
Apache Arrow has the specialized type for UTF-8 encoded string but
doesn’t have types for other character encodings, such as ISO-8859-x
and Shift_JIS. We need to manage what a character encoding is used in
a binary string array, in the outside of the arrays such as metadata.

In Datasets project, one of the goals is to support database
protocols. Some databases support a lot of character encodings in
each manner. For example, PostgreSQL supports to specify what a
character encoding is used for each database, and MySQL allows us to
specify character encodings separately for each level: database,
table, and column.

I have a concern about how does Apache Arrow provide the way to
specify character encodings for values in arrays.

If people can constrain to use UTF-8 for all the string data,
StringArray is enough for them. But if they cannot unify the character
encoding of string data in UTF-8, should Apache Arrow provides the
standard way of the character encoding management?

The example use of Apache Arrow in such case is an application to the
internal data of OR mapper library, such as ActiveRecord of Ruby on
Rails.

My opinion is that Apache Arrow must have the standard way in both its
format and its API. The reason is below:

(1) Currently, when we use MySQL or PostgreSQL as the data source of
record batch streams, we will lose the information of character
encodings the original data have

(2) We need to struggle to support character encoding treatment on
each combination of systems if we don’t have a standard way of
character encoding management, though this is not fit to Apache
Arrow’s philosophy

(3) We cannot support character encoding treatment in the level of
language-binding if Apache Arrow doesn’t provide the standard APIs of
character encoding management

There are two options to manage a character encoding in a BinaryArray.
The first way is introducing an optional character_encoding field in
BinaryType. The second way is using custom_metadata field to supply
the character encoding name.

If we use custom_metadata, we should decide the key for this
information. I guess “charset” is good candidates for the key because
it is widely used for specifying what a character encoding is used.
The value must be the name of a character encoding, such as “UTF-8”
and “Windows-31J”. It is better if we can decide canonical encoding
names, but I guess it is hard work because many systems use the same
name for the different encodings. For example, “Shift_JIS” means
either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
but the corresponding character sets are slightly different. See the
spreadsheet [1] for the correspondence of character encoding names
between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
standard of WhatWG [4].

If we introduce the new optional field for the information of a
character encoding in BinaryType, I recommend let this new field be a
string to keep the name of a character encoding. But it is possible
to make the field integer and let it keep the enum value. I don’t
know there is a good standard for the enum value of character
encodings. IANA manages MIBenum [2], though the registered character
encodings [3] are not enough for our requirement, I think.

I prefer the second way because the first way can supply the
information of character encoding only to a Field, not a BinaryArray.

[1]
https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
[2] https://tools.ietf.org/html/rfc3808
[3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
[4] https://encoding.spec.whatwg.org/

--
Kenta Murata

[DISCUSS][FORMAT] Concerning about character encoding of binary string data

Reply via email to