[Abstract] When we have a string data encoded in a character encoding other than UTF-8, we must use a BinaryArray for the data. But Apache Arrow doesn’t provide the way to specify what a character encoding used in a BinaryArray. In this mail, I’d like to discuss how Apache Arrow provides the way to manage a character encoding in a BinaryArray.
I’d appreciate any comments or suggestions. [Long description] Apache Arrow has the specialized type for UTF-8 encoded string but doesn’t have types for other character encodings, such as ISO-8859-x and Shift_JIS. We need to manage what a character encoding is used in a binary string array, in the outside of the arrays such as metadata. In Datasets project, one of the goals is to support database protocols. Some databases support a lot of character encodings in each manner. For example, PostgreSQL supports to specify what a character encoding is used for each database, and MySQL allows us to specify character encodings separately for each level: database, table, and column. I have a concern about how does Apache Arrow provide the way to specify character encodings for values in arrays. If people can constrain to use UTF-8 for all the string data, StringArray is enough for them. But if they cannot unify the character encoding of string data in UTF-8, should Apache Arrow provides the standard way of the character encoding management? The example use of Apache Arrow in such case is an application to the internal data of OR mapper library, such as ActiveRecord of Ruby on Rails. My opinion is that Apache Arrow must have the standard way in both its format and its API. The reason is below: (1) Currently, when we use MySQL or PostgreSQL as the data source of record batch streams, we will lose the information of character encodings the original data have (2) We need to struggle to support character encoding treatment on each combination of systems if we don’t have a standard way of character encoding management, though this is not fit to Apache Arrow’s philosophy (3) We cannot support character encoding treatment in the level of language-binding if Apache Arrow doesn’t provide the standard APIs of character encoding management There are two options to manage a character encoding in a BinaryArray. The first way is introducing an optional character_encoding field in BinaryType. The second way is using custom_metadata field to supply the character encoding name. If we use custom_metadata, we should decide the key for this information. I guess “charset” is good candidates for the key because it is widely used for specifying what a character encoding is used. The value must be the name of a character encoding, such as “UTF-8” and “Windows-31J”. It is better if we can decide canonical encoding names, but I guess it is hard work because many systems use the same name for the different encodings. For example, “Shift_JIS” means either IANA’s Shift_JIS or Windows-31J, they use the same coding rule but the corresponding character sets are slightly different. See the spreadsheet [1] for the correspondence of character encoding names between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding standard of WhatWG [4]. If we introduce the new optional field for the information of a character encoding in BinaryType, I recommend let this new field be a string to keep the name of a character encoding. But it is possible to make the field integer and let it keep the enum value. I don’t know there is a good standard for the enum value of character encodings. IANA manages MIBenum [2], though the registered character encodings [3] are not enough for our requirement, I think. I prefer the second way because the first way can supply the information of character encoding only to a Field, not a BinaryArray. [1] https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing [2] https://tools.ietf.org/html/rfc3808 [3] https://www.iana.org/assignments/character-sets/character-sets.xhtml [4] https://encoding.spec.whatwg.org/ -- Kenta Murata