Re: Calcite character set and collation support

Julian Hyde Tue, 10 Oct 2017 14:06:52 -0700

There are no plans to improve character and collation support, but we
welcome efforts to do so.

I don't think there is a "typical" database. SQL Server inherits its
i18n support from Windows, so tends to do things very differently
than, say, Oracle.

I would tend to be guided by two principles:

First, be compatible with the SQL standard. (It covers certain things
well, and other things it leaves to the discretion of the
implementation.)

Second, leverage the Java platform as much as possible. I18n support
is very hard, and while Microsoft has Windows to build on, we have
Java.

(When I implemented Mondrian, an open source implementation in Java of
the MDX, a language designed by Microsoft within the Windows
ecosystem, I spent a huge amount of effort re-implementing Visual
Basic format strings, locales, code pages etc. I don't want to do that
again.)

The path of least resistance is probably to translate Microsoft and
MySQL constructs down to their equivalents in Java. Maybe you can
persuade me that that is not possible, but expect me to be skeptical
at first.

A good place to start this process would be to log a JIRA case and
write some unit tests that illustrate what our current
charset/collation support does and extend those tests to show what
support for Microsoft and MySQL would look like.

Julian

On Mon, Oct 9, 2017 at 12:00 AM, Kaiwang Chen <[email protected]> wrote:
> Hi there,
>
> Looks like both charset and collation in Calcite do not map to a typical
> database implementation easily. Is there any plan to improve character and
> collation support?
>
> Currently Calcite supports charset UTF16 and ISO-8859-1(LATIN1) which is
> enforced in SqlUtil.translateCharacterSetName, which in turn is used by
> SqlDataTypeSpec and NlsString.
> Calcite use SqlCollation to support collation, with serialized form like
> "ISO-8859-1$en_US".
>
> Character set: MySQL uses character set names like latin1, utf8, etc.
> Calcite use Java Charset to hold charset. There is a SerializedCharset in
> SqlCollation. RelDataType refers to Java Charset.
>
> Collation: A typical database like MS SQL Server supports collation name
> like Latin1_General_CS_AS_KS_WS (Windows collation name) as well as SQL
> collation name.
> (See
> https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support)
> Basically they are serialized form of several properties like case
> sensitivity, accent sensitivity, etc. MySQL use collation name like
> utf8_unicode_ci which resembles similar properties.
>
> Calcite collation uses Java Locale, which is hard to be mapped to vendor
> properties.
>
> A possible solution is to introduce SqlCharset to substitue Charset and
> make SqlCollation refer to SqlCharset instead. SqlCollation itself can be
> improved with typical properties. The parser can be improved to support
> different serialized forms. SqlTypeUtil.isCharTypeComparable still applies
> with a few comparison adjustments. Additionally, it is reasonable to make
> default collation supplied by RelDataTypeFactory in companion with default
> charset rather than directly from SaffronProperties.
>
> The problem is the incompatible change to RelDataType and
> RelDataTypeFactory.
>
> Thanks,
> kaiwang

Re: Calcite character set and collation support

Reply via email to