[jira] [Comment Edited] (CALCITE-6051) Incorrect format for unicode strings
[ https://issues.apache.org/jira/browse/CALCITE-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775696#comment-17775696 ] LakeShen edited comment on CALCITE-6051 at 10/16/23 11:13 AM: -- I'm sure that PG is ok for 'u&',for example: !image-2023-10-16-18-54-53-483.png|width=436,height=182! So the problem is that different engines or databases have different levels of support for 'u&',in hive or spark,they don't support the 'u&'. I think that jira's title could be more clearly about this problem.How about `Incorrect translation for unicode strings in SqlDialect's quoteStringLiteralUnicode method for HiveSqlDialect and SparkSqlDialect`? At the same time,you should make this JIRA description more clear about your problem. Maybe we could according to SqlDialect#databaseProduct's type, writing different behavior in `quoteStringLiteralUnicode` method. was (Author: shenlang): I'm sure that PG is ok for 'u&',for example: !image-2023-10-16-18-54-53-483.png|width=436,height=182! So the problem is that different engines or databases have different levels of support for 'u&',in hive or spark,they don't support the 'u&'. I think that jira's title could be more clearly about this problem.How about `Incorrect translation for unicode strings in SqlDialect's quoteStringLiteralUnicode method for HiveSqlDialect and SparkSqlDialect`? At the same time,you should make this JIRA description more clear about your problem. > Incorrect format for unicode strings > - > > Key: CALCITE-6051 > URL: https://issues.apache.org/jira/browse/CALCITE-6051 > Project: Calcite > Issue Type: Bug >Reporter: Shivangi >Priority: Major > Attachments: image-2023-10-16-18-54-53-483.png > > > Hi, > The unicodes returned by calcite have broken formats. For example, the string > `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is > coming from > calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java > file, `quoteStringLiteralUnicode` method: > {code:java} > /** >* Converts a string into a unicode string literal. For example, >* can't{tab}run\ becomes u'can''t\0009run\\'. >*/ > public void quoteStringLiteralUnicode(StringBuilder buf, String val) { > buf.append("u&'"); > for (int i = 0; i < val.length(); i++) { > char c = val.charAt(i); > if (c < 32 || c >= 128) { > buf.append('\\'); > buf.append(HEXITS[(c >> 12) & 0xf]); > buf.append(HEXITS[(c >> 8) & 0xf]); > buf.append(HEXITS[(c >> 4) & 0xf]); > buf.append(HEXITS[c & 0xf]); > } else if (c == '\'' || c == '\\') { > buf.append(c); > buf.append(c); > } else { > buf.append(c); > } > } > buf.append("'"); > } > {code} > Why is `buf.append("u&'")` added in this method? I couldn't find relatable > unicode conversion that contains `u&`, as a result, it breaks when read by > the client. I wanted to understand the reason why `u&` is being used and what > can break if we remove `&`. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (CALCITE-6051) Incorrect format for unicode strings
[ https://issues.apache.org/jira/browse/CALCITE-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775690#comment-17775690 ] Shivangi edited comment on CALCITE-6051 at 10/16/23 11:00 AM: -- Also tested the same query you've shared on hive and spark: Hive: {code:java} select u&'hello world'; Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'u': (possible column names are: ) (state=42000,code=10004) {code} Spark: {code:java} select u&'hello world'; User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'u' given input columns: []; line 1 pos 7; {code} was (Author: shivincible): Also tested the same query you've shared on hive and spark: Hive: {code:java} select u&'hello world'; Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'u': (possible column names are: ) (state=42000,code=10004) {code} Spark: {code:java} select u&'hello world'; User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'u' given input columns: []; line 1 pos 7; {code} > Incorrect format for unicode strings > - > > Key: CALCITE-6051 > URL: https://issues.apache.org/jira/browse/CALCITE-6051 > Project: Calcite > Issue Type: Bug >Reporter: Shivangi >Priority: Major > Attachments: image-2023-10-16-18-54-53-483.png > > > Hi, > The unicodes returned by calcite have broken formats. For example, the string > `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is > coming from > calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java > file, `quoteStringLiteralUnicode` method: > {code:java} > /** >* Converts a string into a unicode string literal. For example, >* can't{tab}run\ becomes u'can''t\0009run\\'. >*/ > public void quoteStringLiteralUnicode(StringBuilder buf, String val) { > buf.append("u&'"); > for (int i = 0; i < val.length(); i++) { > char c = val.charAt(i); > if (c < 32 || c >= 128) { > buf.append('\\'); > buf.append(HEXITS[(c >> 12) & 0xf]); > buf.append(HEXITS[(c >> 8) & 0xf]); > buf.append(HEXITS[(c >> 4) & 0xf]); > buf.append(HEXITS[c & 0xf]); > } else if (c == '\'' || c == '\\') { > buf.append(c); > buf.append(c); > } else { > buf.append(c); > } > } > buf.append("'"); > } > {code} > Why is `buf.append("u&'")` added in this method? I couldn't find relatable > unicode conversion that contains `u&`, as a result, it breaks when read by > the client. I wanted to understand the reason why `u&` is being used and what > can break if we remove `&`. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (CALCITE-6051) Incorrect format for unicode strings
[ https://issues.apache.org/jira/browse/CALCITE-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775678#comment-17775678 ] Shivangi edited comment on CALCITE-6051 at 10/16/23 10:40 AM: -- Thanks for the quick response [~shenlang]! We are using SQLDialect for Hive and Spark. For both the cases, the queries fail when we pass a query containing this encoding. For example in hive: {code:java} select * from somedb.some_table where city_id = u&'Conveni\00eancia'; {code} Response: {code:java} FAILED: SemanticException [Error 10004]: Line 1:43 Invalid table alias or column reference 'u': ( {code} This is HiveSqlDialect: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/dialect/HiveSqlDialect.java There is no overriding function in HiveSql dialect corresponding to `quoteStringLiteralUnicode` method in SqlDialect. So, is the output returned by SqlDialect containing `u&'` valid wrt to Postgres? Am I missing something here? was (Author: shivincible): Thanks for the quick response [~shenlang]! We are using SQLDialect for Hive and Spark. For both the cases, the queries fail when we pass a query containing this encoding. For example in hive: {code:java} select * from somedb.some_table where city_id = u&'Conveni\00eancia'; {code} Response: {code:java} FAILED: SemanticException [Error 10004]: Line 1:43 Invalid table alias or column reference 'u': ( {code} This is HiveSqlDialect: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/dialect/HiveSqlDialect.java So, is the output returned by SqlDialect containing `u&'` valid wrt to Presto? Am I missing something here? > Incorrect format for unicode strings > - > > Key: CALCITE-6051 > URL: https://issues.apache.org/jira/browse/CALCITE-6051 > Project: Calcite > Issue Type: Bug >Reporter: Shivangi >Priority: Major > > Hi, > The unicodes returned by calcite have broken formats. For example, the string > `Conveniência` is converted into `u&'Conveni\00eancia'`. Here `u&` is > coming from > calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java > file, `quoteStringLiteralUnicode` method: > {code:java} > /** >* Converts a string into a unicode string literal. For example, >* can't{tab}run\ becomes u'can''t\0009run\\'. >*/ > public void quoteStringLiteralUnicode(StringBuilder buf, String val) { > buf.append("u&'"); > for (int i = 0; i < val.length(); i++) { > char c = val.charAt(i); > if (c < 32 || c >= 128) { > buf.append('\\'); > buf.append(HEXITS[(c >> 12) & 0xf]); > buf.append(HEXITS[(c >> 8) & 0xf]); > buf.append(HEXITS[(c >> 4) & 0xf]); > buf.append(HEXITS[c & 0xf]); > } else if (c == '\'' || c == '\\') { > buf.append(c); > buf.append(c); > } else { > buf.append(c); > } > } > buf.append("'"); > } > {code} > Why is `buf.append("u&'")` added in this method? I couldn't find relatable > unicode conversion that contains `u&`, as a result, it breaks when read by > the client. I wanted to understand the reason why `u&` is being used and what > can break if we remove `&`. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)