[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.

2024-05-03 Thread Caican Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843235#comment-17843235
 ] 

Caican Cai commented on CALCITE-6352:
-

{code:java}
scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2.0, 'b'), 2)")
df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2.0, b), 2): 
boolean]scala> df.show()
+--+
|map_contains_key(map(1, a, 2.0, b), 2)|
+--+
|                                  true|
+--+

scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2.0, 'b'), 2f)")
df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2.0, b), 2.0): 
boolean]scala> df.show()
++
|map_contains_key(map(1, a, 2.0, b), 2.0)|
++
|                                    true|
++
 {code}
[~mbudiu] Hello, It seems that as long as the numbers are the same, true can be 
returned, regardless of whether the number type is double, int, or float.

> The map_contains_key function may return true when the key and mapkeytype 
> types are different.
> --
>
> Key: CALCITE-6352
> URL: https://issues.apache.org/jira/browse/CALCITE-6352
> Project: Calcite
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.36.0
>Reporter: Caican Cai
>Assignee: Caican Cai
>Priority: Critical
> Fix For: 1.37.0
>
>
>  
> {code:java}
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 
> 2.0): boolean]
> scala> df.show()
> +--+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--+
> |                                  true|
> +--+
>  {code}
> calcite return false
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.

2024-04-15 Thread Caican Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837300#comment-17837300
 ] 

Caican Cai commented on CALCITE-6352:
-

I think there are two steps to solving the problem .

The first step is to determine the type judgment rules for map_contain_key in 
spark.

The second step is to handle these special situations specially.

> The map_contains_key function may return true when the key and mapkeytype 
> types are different.
> --
>
> Key: CALCITE-6352
> URL: https://issues.apache.org/jira/browse/CALCITE-6352
> Project: Calcite
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.36.0
>Reporter: Caican Cai
>Assignee: Caican Cai
>Priority: Critical
> Fix For: 1.37.0
>
>
>  
> {code:java}
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 
> 2.0): boolean]
> scala> df.show()
> +--+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--+
> |                                  true|
> +--+
>  {code}
> calcite return false
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.

2024-04-13 Thread Caican Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836820#comment-17836820
 ] 

Caican Cai commented on CALCITE-6352:
-

[~mbudiu] Thank you. I will fix it

> The map_contains_key function may return true when the key and mapkeytype 
> types are different.
> --
>
> Key: CALCITE-6352
> URL: https://issues.apache.org/jira/browse/CALCITE-6352
> Project: Calcite
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.36.0
>Reporter: Caican Cai
>Assignee: Caican Cai
>Priority: Critical
> Fix For: 1.37.0
>
>
>  
> {code:java}
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 
> 2.0): boolean]
> scala> df.show()
> +--+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--+
> |                                  true|
> +--+
>  {code}
> calcite return false
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.

2024-04-04 Thread Mihai Budiu (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834028#comment-17834028
 ] 

Mihai Budiu commented on CALCITE-6352:
--

To decide whether this program is correct you have first to define the type 
system of the language that is being used. 
I haven't seen a type specification of Spark SQL.

The type system describes the types being used, the type assigned to each 
expression, and the implicit type conversion rules. (Note that Calcite has an 
interface called RelDataTypeSystem, but it only covers a subset of the 
functionality of what other people call a "type system"; for example, it does 
not contain the type inference rules.)

In a language like Java, inheriting from the original Java without generics, 
all Maps are just Map under the hood. So the above program 
would be legal (accepted by the type system), but the result returned would be 
"false", because we have:

{code:java}
!((Integer)2).equals((Double)2).
{code{

The same would happen in a dynamically-typed language like Python.

In Java you can also define your own strongly-typed map:

{code:java}
class M {
Map m = new HashMap<>();

String get(K key) {
return this.m.get(key);
}
}
{code}

The Java type checker then will reject an expression like 
{code:java}
new M().get(2.0)
{code}

as having an illegal type.

In a strongly-typed language there are rules for typechecking expressions. The 
typechecking rules describe how expression types are computed based on the 
expression syntactic structure, and which expressions have legal types. They 
also describe when an expression requires inserting implicit casts to be 
interpreted as a legal expression. 

I imagine that the Spark typechecker converts the above expression to something 
like 

{code:java}
map_contains_key(map(1, 'a', 2, 'b'), (Integer)2.0)
{code}

which would typecheck, and return 'true'. (Alternatively, it could convert this 
to 

{code:java}
map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0)
{code}

which would also return 'true'.)

As you know, Calcite is unique in that there is no overall type system that 
dictates these rules, each function and operator has its own rules for 
typechecking and inserting implicit casts. Moreover, and very unfortunately, 
although Calcite infers implicit casts, it does *not* insert them as *explicit* 
casts in the program representation, which I think is a bug, but which it's too 
late to fix.

There are some simple typesystems that could have been used. The simplest 
typesystem simply works bottom-up, inferring the type of an expression from the 
types of its subexpressions. That is probably adequate for classic SQL, but 
runs into problems when you add overloaded functions and generic data 
structures, such as arrays and maps.

There are some generic typechecking algorithms that can be used in the presence 
of such features; a classic algorithm is the Hindley-Milner type unification 
algorithm https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system. In 
that algorithm some expressions have an "unknown" type, represented by a type 
variable. The algorithm finds the "most general type" of any expression. The 
most general type can also be used to dictate where implicit casts need to be 
inserted. But this is mostly of academic interest, since Calcite does not use 
HM. If it did, most of the code that is being written today for type inference 
might not have been needed. On the other hand, if you want compatibility with 
other SQL dialects, which do not use HM, you have to reproduce their behaviors 
(which may be undocumented!)

You can have even more general typechecking algorithms, that essentially 
generate systems of equations using type variables and use powerful generic or 
hand-written domain-specific constraint solvers to find solutions. Type systems 
for real languages such as C++, C#, Java, Rust, that have subtyping, generics, 
overloading, and traits, need more powerful algorithms. You can find 
specifications for these type systems, and they are not simple. E.g., for Java 
https://ozanagma.medium.com/type-checking-in-java-b84d6f38ba8f is a short 
description, and https://docs.oracle.com/javase/specs/jls/se8/html/jls-18.html 
for a precise specification.

People literally still write Ph.D. theses about this stuff. It is complicated, 
and easy to get wrong. 

You can argue that the Spark typesystem design is wrong, but we won't know 
until we see it written down. I haven't looked too much for it.

Since SQL is a relatively simple language, the problem of typechecking is not 
as difficult as for the languages I described above. 

Two conclusions:
- Until you write down the rules, you cannot argue about them.
- I think that it will be very difficult (impossible?) for Calcite to match the 
typesystems of all the SQL dialects out there precisely, especially if 
di

[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.

2024-04-04 Thread Caican Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833966#comment-17833966
 ] 

Caican Cai commented on CALCITE-6352:
-

[~mbudiu] I think this is also inconsistent with the common sense of 
compilation principles, but spark does return true

> The map_contains_key function may return true when the key and mapkeytype 
> types are different.
> --
>
> Key: CALCITE-6352
> URL: https://issues.apache.org/jira/browse/CALCITE-6352
> Project: Calcite
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.36.0
>Reporter: Caican Cai
>Priority: Critical
> Fix For: 1.37.0
>
>
>  
> {code:java}
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 
> 2.0): boolean]
> scala> df.show()
> +--+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--+
> |                                  true|
> +--+
>  {code}
> calcite return false
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)