Hi there, Looks like both charset and collation in Calcite do not map to a typical database implementation easily. Is there any plan to improve character and collation support?
Currently Calcite supports charset UTF16 and ISO-8859-1(LATIN1) which is enforced in SqlUtil.translateCharacterSetName, which in turn is used by SqlDataTypeSpec and NlsString. Calcite use SqlCollation to support collation, with serialized form like "ISO-8859-1$en_US". Character set: MySQL uses character set names like latin1, utf8, etc. Calcite use Java Charset to hold charset. There is a SerializedCharset in SqlCollation. RelDataType refers to Java Charset. Collation: A typical database like MS SQL Server supports collation name like Latin1_General_CS_AS_KS_WS (Windows collation name) as well as SQL collation name. (See https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support) Basically they are serialized form of several properties like case sensitivity, accent sensitivity, etc. MySQL use collation name like utf8_unicode_ci which resembles similar properties. Calcite collation uses Java Locale, which is hard to be mapped to vendor properties. A possible solution is to introduce SqlCharset to substitue Charset and make SqlCollation refer to SqlCharset instead. SqlCollation itself can be improved with typical properties. The parser can be improved to support different serialized forms. SqlTypeUtil.isCharTypeComparable still applies with a few comparison adjustments. Additionally, it is reasonable to make default collation supplied by RelDataTypeFactory in companion with default charset rather than directly from SaffronProperties. The problem is the incompatible change to RelDataType and RelDataTypeFactory. Thanks, kaiwang
