[ https://issues.apache.org/jira/browse/CALCITE-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shivangi updated CALCITE-6051: ------------------------------ Description: Hi, The unicodes returned by calcite have broken formats. For example, the string `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is coming from calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java file, `quoteStringLiteralUnicode` method: {code:java} /** * Converts a string into a unicode string literal. For example, * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>. */ public void quoteStringLiteralUnicode(StringBuilder buf, String val) { buf.append("u&'"); for (int i = 0; i < val.length(); i++) { char c = val.charAt(i); if (c < 32 || c >= 128) { buf.append('\\'); buf.append(HEXITS[(c >> 12) & 0xf]); buf.append(HEXITS[(c >> 8) & 0xf]); buf.append(HEXITS[(c >> 4) & 0xf]); buf.append(HEXITS[c & 0xf]); } else if (c == '\'' || c == '\\') { buf.append(c); buf.append(c); } else { buf.append(c); } } buf.append("'"); } {code} Why is `buf.append("u&'")` added in this method? I couldn't find relatable unicode conversion that contains `u&`, as a result, it breaks when read by the client. I wanted to understand the reason why `u&` is being used and what can break if we remove `&`. Thanks! was: Hi, The unicodes returned by calcite have broken formats. For example, the string `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is coming from calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java file, `quoteStringLiteralUnicode` method: {code:java} /** * Converts a string into a unicode string literal. For example, * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>. */ public void quoteStringLiteralUnicode(StringBuilder buf, String val) { buf.append("u&'"); for (int i = 0; i < val.length(); i++) { char c = val.charAt(i); if (c < 32 || c >= 128) { buf.append('\\'); buf.append(HEXITS[(c >> 12) & 0xf]); buf.append(HEXITS[(c >> 8) & 0xf]); buf.append(HEXITS[(c >> 4) & 0xf]); buf.append(HEXITS[c & 0xf]); } else if (c == '\'' || c == '\\') { buf.append(c); buf.append(c); } else { buf.append(c); } } buf.append("'"); } {code} Why is `buf.append("u&'")` added in this method? I couldn't find relatable unicode conversion that contains `u&`, as a result, it breaks when read by the client. I wanted to understand the reason why `u&` is being used and what can break if we remove `&`. Thanks! > Incorrect translation for unicode strings in SqlDialect's > quoteStringLiteralUnicode method for HiveSqlDialect and SparkSqlDialect > --------------------------------------------------------------------------------------------------------------------------------- > > Key: CALCITE-6051 > URL: https://issues.apache.org/jira/browse/CALCITE-6051 > Project: Calcite > Issue Type: Bug > Reporter: Shivangi > Priority: Major > Attachments: image-2023-10-16-18-54-53-483.png > > > Hi, > The unicodes returned by calcite have broken formats. For example, the string > `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is > coming from > calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java > file, `quoteStringLiteralUnicode` method: > {code:java} > /** > * Converts a string into a unicode string literal. For example, > * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>. > */ > public void quoteStringLiteralUnicode(StringBuilder buf, String val) { > buf.append("u&'"); > for (int i = 0; i < val.length(); i++) { > char c = val.charAt(i); > if (c < 32 || c >= 128) { > buf.append('\\'); > buf.append(HEXITS[(c >> 12) & 0xf]); > buf.append(HEXITS[(c >> 8) & 0xf]); > buf.append(HEXITS[(c >> 4) & 0xf]); > buf.append(HEXITS[c & 0xf]); > } else if (c == '\'' || c == '\\') { > buf.append(c); > buf.append(c); > } else { > buf.append(c); > } > } > buf.append("'"); > } > {code} > Why is `buf.append("u&'")` added in this method? I couldn't find relatable > unicode conversion that contains `u&`, as a result, it breaks when read by > the client. I wanted to understand the reason why `u&` is being used and what > can break if we remove `&`. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)