Hi there,

Looks like both charset and collation in Calcite do not map to a typical
database implementation easily. Is there any plan to improve character and
collation support?

Currently Calcite supports charset UTF16 and ISO-8859-1(LATIN1) which is
enforced in SqlUtil.translateCharacterSetName, which in turn is used by
SqlDataTypeSpec and NlsString.
Calcite use SqlCollation to support collation, with serialized form like
"ISO-8859-1$en_US".

Character set: MySQL uses character set names like latin1, utf8, etc.
Calcite use Java Charset to hold charset. There is a SerializedCharset in
SqlCollation. RelDataType refers to Java Charset.

Collation: A typical database like MS SQL Server supports collation name
like Latin1_General_CS_AS_KS_WS (Windows collation name) as well as SQL
collation name.
(See
https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support)
Basically they are serialized form of several properties like case
sensitivity, accent sensitivity, etc. MySQL use collation name like
utf8_unicode_ci which resembles similar properties.

Calcite collation uses Java Locale, which is hard to be mapped to vendor
properties.

A possible solution is to introduce SqlCharset to substitue Charset and
make SqlCollation refer to SqlCharset instead. SqlCollation itself can be
improved with typical properties. The parser can be improved to support
different serialized forms. SqlTypeUtil.isCharTypeComparable still applies
with a few comparison adjustments. Additionally, it is reasonable to make
default collation supplied by RelDataTypeFactory in companion with default
charset rather than directly from SaffronProperties.

The problem is the incompatible change to RelDataType and
RelDataTypeFactory.

Thanks,
kaiwang

Reply via email to