There are no plans to improve character and collation support, but we welcome efforts to do so.
I don't think there is a "typical" database. SQL Server inherits its i18n support from Windows, so tends to do things very differently than, say, Oracle. I would tend to be guided by two principles: First, be compatible with the SQL standard. (It covers certain things well, and other things it leaves to the discretion of the implementation.) Second, leverage the Java platform as much as possible. I18n support is very hard, and while Microsoft has Windows to build on, we have Java. (When I implemented Mondrian, an open source implementation in Java of the MDX, a language designed by Microsoft within the Windows ecosystem, I spent a huge amount of effort re-implementing Visual Basic format strings, locales, code pages etc. I don't want to do that again.) The path of least resistance is probably to translate Microsoft and MySQL constructs down to their equivalents in Java. Maybe you can persuade me that that is not possible, but expect me to be skeptical at first. A good place to start this process would be to log a JIRA case and write some unit tests that illustrate what our current charset/collation support does and extend those tests to show what support for Microsoft and MySQL would look like. Julian On Mon, Oct 9, 2017 at 12:00 AM, Kaiwang Chen <[email protected]> wrote: > Hi there, > > Looks like both charset and collation in Calcite do not map to a typical > database implementation easily. Is there any plan to improve character and > collation support? > > Currently Calcite supports charset UTF16 and ISO-8859-1(LATIN1) which is > enforced in SqlUtil.translateCharacterSetName, which in turn is used by > SqlDataTypeSpec and NlsString. > Calcite use SqlCollation to support collation, with serialized form like > "ISO-8859-1$en_US". > > Character set: MySQL uses character set names like latin1, utf8, etc. > Calcite use Java Charset to hold charset. There is a SerializedCharset in > SqlCollation. RelDataType refers to Java Charset. > > Collation: A typical database like MS SQL Server supports collation name > like Latin1_General_CS_AS_KS_WS (Windows collation name) as well as SQL > collation name. > (See > https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support) > Basically they are serialized form of several properties like case > sensitivity, accent sensitivity, etc. MySQL use collation name like > utf8_unicode_ci which resembles similar properties. > > Calcite collation uses Java Locale, which is hard to be mapped to vendor > properties. > > A possible solution is to introduce SqlCharset to substitue Charset and > make SqlCollation refer to SqlCharset instead. SqlCollation itself can be > improved with typical properties. The parser can be improved to support > different serialized forms. SqlTypeUtil.isCharTypeComparable still applies > with a few comparison adjustments. Additionally, it is reasonable to make > default collation supplied by RelDataTypeFactory in companion with default > charset rather than directly from SaffronProperties. > > The problem is the incompatible change to RelDataType and > RelDataTypeFactory. > > Thanks, > kaiwang
