[jira] [Commented] (CALCITE-6001) Add useUtf8AsDefaultCharset flag to SqlConformanceEnum to allow encoding of non-ISO-8859-1 characters

Julian Hyde (Jira) Wed, 18 Oct 2023 08:51:05 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776753#comment-17776753
 ]


Julian Hyde commented on CALCITE-6001:
--------------------------------------

Can we be sure to test on characters that are (1) 7-bit ASCII, (2) 8-bit ASCII, 
(3) UTF-8, (4) non UTF-8. (Maybe category 4 is empty... are there any Unicode 
characters that cannot be expressed in UTF-8?)

And the test should call out which category they are testing. This will be 
valuable because databases will inevitably have different levels of support.



> Add useUtf8AsDefaultCharset flag to SqlConformanceEnum to allow encoding of 
> non-ISO-8859-1 characters
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6001
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6001
>             Project: Calcite
>          Issue Type: New Feature
>            Reporter: Tanner Clary
>            Assignee: Tanner Clary
>            Priority: Major
>              Labels: pull-request-available
>
> Many dialects supported by Calcite encode their strings using a default 
> charset (most commonly UTF-8 or ISO-8859-1). For example, BigQuery uses 
> [UTF-8|https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#string_type].
>  I am proposing to add a dialect property to be referenced when converting 
> string literals so that the current dialect's default is used unless 
> otherwise specified.
> Presently, if no charset is specified when converting to RexLiterals 
> [here|https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rex/RexBuilder.java#L1618],
>  the CalciteSystemProperty {{DEFAULT_CHARSET}} is used 
> ([docs|https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/config/CalciteSystemProperty.java#L300])
>  which is set as ISO-8859-1.
> This means that when converting a query like:
> {{select 'ק' as result;}}
>  you will get the following the error: {{Failed to encode 'ק' in character 
> set 'ISO-8859-1'}}.
> This failure is unexpected if you are using BigQuery conformance(or any 
> dialect whose default is UTF-8).
> Of course an alternative solution would be to just change the Calcite default 
> to UTF-8 which supports encoding any UNICODE character while ISO-8859-1 can 
> only encode the first 256, but I imagine there are reasons against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6001) Add useUtf8AsDefaultCharset flag to SqlConformanceEnum to allow encoding of non-ISO-8859-1 characters

Reply via email to