Tanner Clary created CALCITE-6001:
-------------------------------------

             Summary: Add default charset per dialect
                 Key: CALCITE-6001
                 URL: https://issues.apache.org/jira/browse/CALCITE-6001
             Project: Calcite
          Issue Type: New Feature
            Reporter: Tanner Clary
            Assignee: Tanner Clary


Many dialects supported by Calcite encode their strings using a default charset 
(most commonly UTF-8 or ISO-8859-1). For example, BigQuery uses 
[UTF-8|https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#string_type].
 I am proposing to add a dialect property to be referenced when converting 
string literals so that the current dialect's default is used unless otherwise 
specified.

Presently, if no charset is specified when converting to RexLiterals 
[here|https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rex/RexBuilder.java#L1618],
 the CalciteSystemProperty {{DEFAULT_CHARSET}} is used 
([docs|https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/config/CalciteSystemProperty.java#L300])
 which is set as ISO-8859-1.

This means that when converting a query like:
{{select 'ק' as result;}}
 you will get the following the error: {{Failed to encode 'ק' in character set 
'ISO-8859-1'}}.

This failure is unexpected if you are using BigQuery conformance(or any dialect 
whose default is UTF-8).

Of course an alternative solution would be to just change the Calcite default 
to UTF-8 which supports encoding any UNICODE character while ISO-8859-1 can 
only encode the first 256, but I imagine there are reasons against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to