Thanks for responding. I understand ExtensionType is suitable for handling character encoding. I'll try to make and propose draft specification and implementation of an extension type.
Regards, Kenta Murata 2019年9月5日(木) 7:56 Wes McKinney <wesmck...@gmail.com>: > > I opened https://issues.apache.org/jira/browse/ARROW-6455. It might > make sense to define a common ExtensionType metadata in case multiple > implementations decide they need this > > On Tue, Sep 3, 2019 at 10:35 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > This might be bike-shedding but I agree we should attempt to use extension > > types for this use-cases. I would expect something like: > > ARROW:extension:name=NonUtf8String > > ARROW:extension:metadata = "{\"iso-charset\": "ISO-8859-10"}" > > > > The latter's value being a json encoded string, which captures the > > character set. > > > > Thanks, > > Micah > > > > > > On Tue, Sep 3, 2019 at 6:59 PM Sutou Kouhei <k...@clear-code.com> wrote: > > > > > Hi, > > > > > > > If people can constrain to use UTF-8 for all the string data, > > > > StringArray is enough for them. But if they cannot unify the character > > > > encoding of string data in UTF-8, should Apache Arrow provides the > > > > standard way of the character encoding management? > > > > > > I think that Apache Arrow users should convert their string > > > data to UTF-8 in their application. If Apache Arrow only > > > supports UTF-8 string, Apache Arrow users can process string > > > data without converting encoding between multiple systems. I > > > think no conversion (zero-copy) use is Apache Arrow way. > > > > > > > My opinion is that Apache Arrow must have the standard way in both its > > > > format and its API. The reason is below: > > > > > > > > (1) Currently, when we use MySQL or PostgreSQL as the data source of > > > > record batch streams, we will lose the information of character > > > > encodings the original data have > > > > > > Both MySQL and PostgreSQL provide encoding conversion > > > feature. So we can convert the original data to UTF-8. > > > > > > MySQL: > > > > > > CONVERT function > > > > > > https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert > > > > > > PostgreSQL: > > > > > > convert_to function > > > > > > https://www.postgresql.org/docs/11/functions-string.html#id-1.5.8.9.7.2.2.8.1.1 > > > > > > > > > > > > If we need to support non UTF-8 encodings, I like > > > NonUTF8String or something extension type and metadata > > > approach. I prefer "ARROW:encoding" rather than > > > "ARROW:charset" for metadata key too. > > > > > > > > > Thanks, > > > -- > > > kou > > > > > > In <cahnzt+2r_gekw8cduhunripuyvuqf7xd_p5kwmn50qfzzlc...@mail.gmail.com> > > > "[DISCUSS][FORMAT] Concerning about character encoding of binary string > > > data" on Mon, 2 Sep 2019 17:39:22 +0900, > > > Kenta Murata <m...@mrkn.jp> wrote: > > > > > > > [Abstract] > > > > When we have a string data encoded in a character encoding other than > > > > UTF-8, we must use a BinaryArray for the data. But Apache Arrow > > > > doesn’t provide the way to specify what a character encoding used in a > > > > BinaryArray. In this mail, I’d like to discuss how Apache Arrow > > > > provides the way to manage a character encoding in a BinaryArray. > > > > > > > > I’d appreciate any comments or suggestions. > > > > > > > > [Long description] > > > > Apache Arrow has the specialized type for UTF-8 encoded string but > > > > doesn’t have types for other character encodings, such as ISO-8859-x > > > > and Shift_JIS. We need to manage what a character encoding is used in > > > > a binary string array, in the outside of the arrays such as metadata. > > > > > > > > In Datasets project, one of the goals is to support database > > > > protocols. Some databases support a lot of character encodings in > > > > each manner. For example, PostgreSQL supports to specify what a > > > > character encoding is used for each database, and MySQL allows us to > > > > specify character encodings separately for each level: database, > > > > table, and column. > > > > > > > > I have a concern about how does Apache Arrow provide the way to > > > > specify character encodings for values in arrays. > > > > > > > > If people can constrain to use UTF-8 for all the string data, > > > > StringArray is enough for them. But if they cannot unify the character > > > > encoding of string data in UTF-8, should Apache Arrow provides the > > > > standard way of the character encoding management? > > > > > > > > The example use of Apache Arrow in such case is an application to the > > > > internal data of OR mapper library, such as ActiveRecord of Ruby on > > > > Rails. > > > > > > > > My opinion is that Apache Arrow must have the standard way in both its > > > > format and its API. The reason is below: > > > > > > > > (1) Currently, when we use MySQL or PostgreSQL as the data source of > > > > record batch streams, we will lose the information of character > > > > encodings the original data have > > > > > > > > (2) We need to struggle to support character encoding treatment on > > > > each combination of systems if we don’t have a standard way of > > > > character encoding management, though this is not fit to Apache > > > > Arrow’s philosophy > > > > > > > > (3) We cannot support character encoding treatment in the level of > > > > language-binding if Apache Arrow doesn’t provide the standard APIs of > > > > character encoding management > > > > > > > > There are two options to manage a character encoding in a BinaryArray. > > > > The first way is introducing an optional character_encoding field in > > > > BinaryType. The second way is using custom_metadata field to supply > > > > the character encoding name. > > > > > > > > If we use custom_metadata, we should decide the key for this > > > > information. I guess “charset” is good candidates for the key because > > > > it is widely used for specifying what a character encoding is used. > > > > The value must be the name of a character encoding, such as “UTF-8” > > > > and “Windows-31J”. It is better if we can decide canonical encoding > > > > names, but I guess it is hard work because many systems use the same > > > > name for the different encodings. For example, “Shift_JIS” means > > > > either IANA’s Shift_JIS or Windows-31J, they use the same coding rule > > > > but the corresponding character sets are slightly different. See the > > > > spreadsheet [1] for the correspondence of character encoding names > > > > between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding > > > > standard of WhatWG [4]. > > > > > > > > If we introduce the new optional field for the information of a > > > > character encoding in BinaryType, I recommend let this new field be a > > > > string to keep the name of a character encoding. But it is possible > > > > to make the field integer and let it keep the enum value. I don’t > > > > know there is a good standard for the enum value of character > > > > encodings. IANA manages MIBenum [2], though the registered character > > > > encodings [3] are not enough for our requirement, I think. > > > > > > > > I prefer the second way because the first way can supply the > > > > information of character encoding only to a Field, not a BinaryArray. > > > > > > > > [1] > > > https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing > > > > [2] https://tools.ietf.org/html/rfc3808 > > > > [3] https://www.iana.org/assignments/character-sets/character-sets.xhtml > > > > [4] https://encoding.spec.whatwg.org/ > > > > > > > > -- > > > > Kenta Murata > > >