Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

Micah Kornfield Tue, 03 Sep 2019 20:35:36 -0700

This might be bike-shedding but I agree we should attempt to use extension
types for this use-cases.  I would expect something like:
ARROW:extension:name=NonUtf8String
ARROW:extension:metadata = "{\"iso-charset\":  "ISO-8859-10"}"


The latter's value being a json encoded string, which captures the
character set.

Thanks,
Micah


On Tue, Sep 3, 2019 at 6:59 PM Sutou Kouhei <k...@clear-code.com> wrote:

> Hi,
>
> > If people can constrain to use UTF-8 for all the string data,
> > StringArray is enough for them. But if they cannot unify the character
> > encoding of string data in UTF-8, should Apache Arrow provides the
> > standard way of the character encoding management?
>
> I think that Apache Arrow users should convert their string
> data to UTF-8 in their application. If Apache Arrow only
> supports UTF-8 string, Apache Arrow users can process string
> data without converting encoding between multiple systems. I
> think no conversion (zero-copy) use is Apache Arrow way.
>
> > My opinion is that Apache Arrow must have the standard way in both its
> > format and its API.  The reason is below:
> >
> > (1) Currently, when we use MySQL or PostgreSQL as the data source of
> > record batch streams, we will lose the information of character
> > encodings the original data have
>
> Both MySQL and PostgreSQL provide encoding conversion
> feature. So we can convert the original data to UTF-8.
>
> MySQL:
>
>   CONVERT function
>
> https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert
>
> PostgreSQL:
>
>   convert_to function
>
> https://www.postgresql.org/docs/11/functions-string.html#id-1.5.8.9.7.2.2.8.1.1
>
>
>
> If we need to support non UTF-8 encodings, I like
> NonUTF8String or something extension type and metadata
> approach. I prefer "ARROW:encoding" rather than
> "ARROW:charset" for metadata key too.
>
>
> Thanks,
> --
> kou
>
> In <cahnzt+2r_gekw8cduhunripuyvuqf7xd_p5kwmn50qfzzlc...@mail.gmail.com>
>   "[DISCUSS][FORMAT] Concerning about character encoding of binary string
> data" on Mon, 2 Sep 2019 17:39:22 +0900,
>   Kenta Murata <m...@mrkn.jp> wrote:
>
> > [Abstract]
> > When we have a string data encoded in a character encoding other than
> > UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
> > doesn’t provide the way to specify what a character encoding used in a
> > BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
> > provides the way to manage a character encoding in a BinaryArray.
> >
> > I’d appreciate any comments or suggestions.
> >
> > [Long description]
> > Apache Arrow has the specialized type for UTF-8 encoded string but
> > doesn’t have types for other character encodings, such as ISO-8859-x
> > and Shift_JIS. We need to manage what a character encoding is used in
> > a binary string array, in the outside of the arrays such as metadata.
> >
> > In Datasets project, one of the goals is to support database
> > protocols.  Some databases support a lot of character encodings in
> > each manner.  For example, PostgreSQL supports to specify what a
> > character encoding is used for each database, and MySQL allows us to
> > specify character encodings separately for each level: database,
> > table, and column.
> >
> > I have a concern about how does Apache Arrow provide the way to
> > specify character encodings for values in arrays.
> >
> > If people can constrain to use UTF-8 for all the string data,
> > StringArray is enough for them. But if they cannot unify the character
> > encoding of string data in UTF-8, should Apache Arrow provides the
> > standard way of the character encoding management?
> >
> > The example use of Apache Arrow in such case is an application to the
> > internal data of OR mapper library, such as ActiveRecord of Ruby on
> > Rails.
> >
> > My opinion is that Apache Arrow must have the standard way in both its
> > format and its API.  The reason is below:
> >
> > (1) Currently, when we use MySQL or PostgreSQL as the data source of
> > record batch streams, we will lose the information of character
> > encodings the original data have
> >
> > (2) We need to struggle to support character encoding treatment on
> > each combination of systems if we don’t have a standard way of
> > character encoding management, though this is not fit to Apache
> > Arrow’s philosophy
> >
> > (3) We cannot support character encoding treatment in the level of
> > language-binding if Apache Arrow doesn’t provide the standard APIs of
> > character encoding management
> >
> > There are two options to manage a character encoding in a BinaryArray.
> > The first way is introducing an optional character_encoding field in
> > BinaryType.  The second way is using custom_metadata field to supply
> > the character encoding name.
> >
> > If we use custom_metadata, we should decide the key for this
> > information.  I guess “charset” is good candidates for the key because
> > it is widely used for specifying what a character encoding is used.
> > The value must be the name of a character encoding, such as “UTF-8”
> > and “Windows-31J”.  It is better if we can decide canonical encoding
> > names, but I guess it is hard work because many systems use the same
> > name for the different encodings.  For example, “Shift_JIS” means
> > either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
> > but the corresponding character sets are slightly different.  See the
> > spreadsheet [1] for the correspondence of character encoding names
> > between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
> > standard of WhatWG [4].
> >
> > If we introduce the new optional field for the information of a
> > character encoding in BinaryType, I recommend let this new field be a
> > string to keep the name of a character encoding.  But it is possible
> > to make the field integer and let it keep the enum value.  I don’t
> > know there is a good standard for the enum value of character
> > encodings.  IANA manages MIBenum [2], though the registered character
> > encodings [3] are not enough for our requirement, I think.
> >
> > I prefer the second way because the first way can supply the
> > information of character encoding only to a Field, not a BinaryArray.
> >
> > [1]
> https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
> > [2] https://tools.ietf.org/html/rfc3808
> > [3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
> > [4] https://encoding.spec.whatwg.org/
> >
> > --
> > Kenta Murata
>

Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

Reply via email to