[ 
https://issues.apache.org/jira/browse/CALCITE-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivangi updated CALCITE-6051:
------------------------------
    Description: 
Hi,
The unicodes returned by calcite have broken formats. For example, the string 
`Conveniência` is converted into   `u&'Conveni\00eancia'`. Here `u&` is coming 
from 
calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java
 file, `quoteStringLiteralUnicode` method:

{code:java}
  /**
   * Converts a string into a unicode string literal. For example,
   * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>.
   */
  public void quoteStringLiteralUnicode(StringBuilder buf, String val) {
    buf.append("u&'");
    for (int i = 0; i < val.length(); i++) {
      char c = val.charAt(i);
      if (c < 32 || c >= 128) {
        buf.append('\\');
        buf.append(HEXITS[(c >> 12) & 0xf]);
        buf.append(HEXITS[(c >> 8) & 0xf]);
        buf.append(HEXITS[(c >> 4) & 0xf]);
        buf.append(HEXITS[c & 0xf]);
      } else if (c == '\'' || c == '\\') {
        buf.append(c);
        buf.append(c);
      } else {
        buf.append(c);
      }
    }
    buf.append("'");
  }
{code}



Why is `buf.append("u&'")` added in this method? I couldn't find relatable 
unicode conversion that contains  `u&`, as a result, it breaks when read by the 
client. I wanted to understand the reason why `u&` is being used and what can 
break if we remove `&`.

Thanks! 


  was:
Hi,
The unicodes returned by calcite have broken formats. For example, the string 
`Conveniência` is converted into   `u&'Conveni\00eancia'`. Here `u&` is coming 
from 
calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java
 file, `quoteStringLiteralUnicode` method:

{code:java}
  /**
   * Converts a string into a unicode string literal. For example,
   * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>.
   */
  public void quoteStringLiteralUnicode(StringBuilder buf, String val) {
    buf.append("u&'");
    for (int i = 0; i < val.length(); i++) {
      char c = val.charAt(i);
      if (c < 32 || c >= 128) {
        buf.append('\\');
        buf.append(HEXITS[(c >> 12) & 0xf]);
        buf.append(HEXITS[(c >> 8) & 0xf]);
        buf.append(HEXITS[(c >> 4) & 0xf]);
        buf.append(HEXITS[c & 0xf]);
      } else if (c == '\'' || c == '\\') {
        buf.append(c);
        buf.append(c);
      } else {
        buf.append(c);
      }
    }
    buf.append("'");
  }
{code}

Why is `buf.append("u&'")` added in this method? I couldn't find relatable 
unicode conversion that contains  `u&`, as a result, it breaks when read by the 
client. I wanted to understand the reason why `u&` is being used and what can 
break if we remove `&`.

Thanks! 



> Incorrect translation for unicode strings in SqlDialect's 
> quoteStringLiteralUnicode method for HiveSqlDialect and SparkSqlDialect
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6051
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6051
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Shivangi
>            Priority: Major
>         Attachments: image-2023-10-16-18-54-53-483.png
>
>
> Hi,
> The unicodes returned by calcite have broken formats. For example, the string 
> `Conveniência` is converted into   `u&'Conveni\00eancia'`. Here `u&` is 
> coming from 
> calcite-core-1.2.0-incubating-sources.jar!/org/apache/calcite/sql/SqlDialect.java
>  file, `quoteStringLiteralUnicode` method:
> {code:java}
>   /**
>    * Converts a string into a unicode string literal. For example,
>    * <code>can't{tab}run\</code> becomes <code>u'can''t\0009run\\'</code>.
>    */
>   public void quoteStringLiteralUnicode(StringBuilder buf, String val) {
>     buf.append("u&'");
>     for (int i = 0; i < val.length(); i++) {
>       char c = val.charAt(i);
>       if (c < 32 || c >= 128) {
>         buf.append('\\');
>         buf.append(HEXITS[(c >> 12) & 0xf]);
>         buf.append(HEXITS[(c >> 8) & 0xf]);
>         buf.append(HEXITS[(c >> 4) & 0xf]);
>         buf.append(HEXITS[c & 0xf]);
>       } else if (c == '\'' || c == '\\') {
>         buf.append(c);
>         buf.append(c);
>       } else {
>         buf.append(c);
>       }
>     }
>     buf.append("'");
>   }
> {code}
> Why is `buf.append("u&'")` added in this method? I couldn't find relatable 
> unicode conversion that contains  `u&`, as a result, it breaks when read by 
> the client. I wanted to understand the reason why `u&` is being used and what 
> can break if we remove `&`.
> Thanks! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to