[ 
https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834028#comment-17834028
 ] 

Mihai Budiu edited comment on CALCITE-6352 at 4/4/24 5:05 PM:
--------------------------------------------------------------

To decide whether this program is correct you have first to define the type 
system of the language that is being used. 
I haven't seen a type specification of Spark SQL.

The type system describes the types being used, the type assigned to each 
expression, and the implicit type conversion rules. (Note that Calcite has an 
interface called RelDataTypeSystem, but it only covers a subset of the 
functionality of what other people call a "type system"; for example, it does 
not contain the type inference rules.)

In a language like Java, inheriting from the original Java without generics, 
all Maps are just Map<Object, Object> under the hood. So the above program 
would be legal (accepted by the type system), but the result returned would be 
"false", because we have:

{code:java}
!((Integer)2).equals((Double)2).
{code}

The same would happen in a dynamically-typed language like Python.

In Java you can also define your own strongly-typed map:

{code:java}
class M<K> {
        Map<K, String> m = new HashMap<>();

        String get(K key) {
            return this.m.get(key);
        }
}
{code}

The Java type checker then will reject an expression like 
{code:java}
new M<Integer>().get(2.0)
{code}

as having an illegal type.

In a strongly-typed language there are rules for typechecking expressions. The 
typechecking rules describe how expression types are computed based on the 
expression syntactic structure, and which expressions have legal types. They 
also describe when an expression requires inserting implicit casts to be 
interpreted as a legal expression. 

I imagine that the Spark typechecker converts the above expression to something 
like 

{code:java}
map_contains_key(map<Integer, String>(1, 'a', 2, 'b'), (Integer)2.0)
{code}

which would typecheck, and return 'true'. (Alternatively, it could convert this 
to 

{code:java}
map_contains_key(map<Double, String>((Double)1, 'a', (Double)2, 'b'), 2.0)
{code}

which would also return 'true'.)

As you know, Calcite is unique in that there is no overall type system that 
dictates these rules, each function and operator has its own rules for 
typechecking and inserting implicit casts. Moreover, and very unfortunately, 
although Calcite infers implicit casts, it does *not* insert them as *explicit* 
casts in the program representation, which I think is a bug, but which it's too 
late to fix.

There are some simple typesystems that could have been used. The simplest 
typesystem simply works bottom-up, inferring the type of an expression from the 
types of its subexpressions. That is probably adequate for classic SQL, but 
runs into problems when you add overloaded functions and generic data 
structures, such as arrays and maps.

There are some generic typechecking algorithms that can be used in the presence 
of such features; a classic algorithm is the Hindley-Milner type unification 
algorithm https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system. In 
that algorithm some expressions have an "unknown" type, represented by a type 
variable. The algorithm finds the "most general type" of any expression. The 
most general type can also be used to dictate where implicit casts need to be 
inserted. But this is mostly of academic interest, since Calcite does not use 
HM. If it did, most of the code that is being written today for type inference 
might not have been needed. On the other hand, if you want compatibility with 
other SQL dialects, which do not use HM, you have to reproduce their behaviors 
(which may be undocumented!)

You can have even more general typechecking algorithms, that essentially 
generate systems of equations using type variables and use powerful generic or 
hand-written domain-specific constraint solvers to find solutions. Type systems 
for real languages such as C++, C#, Java, Rust, that have subtyping, generics, 
overloading, and traits, need more powerful algorithms. You can find 
specifications for these type systems, and they are not simple. E.g., for Java 
https://ozanagma.medium.com/type-checking-in-java-b84d6f38ba8f is a short 
description, and https://docs.oracle.com/javase/specs/jls/se8/html/jls-18.html 
for a precise specification.

People literally still write Ph.D. theses about this stuff. It is complicated, 
and easy to get wrong. 

You can argue that the Spark typesystem design is wrong, but we won't know 
until we see it written down. I haven't looked too much for it.

Since SQL is a relatively simple language, the problem of typechecking is not 
as difficult as for the languages I described above. 

Two conclusions:
- Until you write down the rules, you cannot argue about them.
- I think that it will be very difficult (impossible?) for Calcite to match the 
typesystems of all the SQL dialects out there precisely, especially if 
different SQL dialects use different rules given the same expression. See for 
example the discussion we had about the semantics of casting varbinary to 
string: https://issues.apache.org/jira/browse/CALCITE-6210 (although that 
discussion is not about type-checking, but about the runtime interpretation of 
expressions).


was (Author: JIRAUSER295926):
To decide whether this program is correct you have first to define the type 
system of the language that is being used. 
I haven't seen a type specification of Spark SQL.

The type system describes the types being used, the type assigned to each 
expression, and the implicit type conversion rules. (Note that Calcite has an 
interface called RelDataTypeSystem, but it only covers a subset of the 
functionality of what other people call a "type system"; for example, it does 
not contain the type inference rules.)

In a language like Java, inheriting from the original Java without generics, 
all Maps are just Map<Object, Object> under the hood. So the above program 
would be legal (accepted by the type system), but the result returned would be 
"false", because we have:

{code:java}
!((Integer)2).equals((Double)2).
{code{

The same would happen in a dynamically-typed language like Python.

In Java you can also define your own strongly-typed map:

{code:java}
class M<K> {
        Map<K, String> m = new HashMap<>();

        String get(K key) {
            return this.m.get(key);
        }
}
{code}

The Java type checker then will reject an expression like 
{code:java}
new M<Integer>().get(2.0)
{code}

as having an illegal type.

In a strongly-typed language there are rules for typechecking expressions. The 
typechecking rules describe how expression types are computed based on the 
expression syntactic structure, and which expressions have legal types. They 
also describe when an expression requires inserting implicit casts to be 
interpreted as a legal expression. 

I imagine that the Spark typechecker converts the above expression to something 
like 

{code:java}
map_contains_key(map<Integer, String>(1, 'a', 2, 'b'), (Integer)2.0)
{code}

which would typecheck, and return 'true'. (Alternatively, it could convert this 
to 

{code:java}
map_contains_key(map<Double, String>((Double)1, 'a', (Double)2, 'b'), 2.0)
{code}

which would also return 'true'.)

As you know, Calcite is unique in that there is no overall type system that 
dictates these rules, each function and operator has its own rules for 
typechecking and inserting implicit casts. Moreover, and very unfortunately, 
although Calcite infers implicit casts, it does *not* insert them as *explicit* 
casts in the program representation, which I think is a bug, but which it's too 
late to fix.

There are some simple typesystems that could have been used. The simplest 
typesystem simply works bottom-up, inferring the type of an expression from the 
types of its subexpressions. That is probably adequate for classic SQL, but 
runs into problems when you add overloaded functions and generic data 
structures, such as arrays and maps.

There are some generic typechecking algorithms that can be used in the presence 
of such features; a classic algorithm is the Hindley-Milner type unification 
algorithm https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system. In 
that algorithm some expressions have an "unknown" type, represented by a type 
variable. The algorithm finds the "most general type" of any expression. The 
most general type can also be used to dictate where implicit casts need to be 
inserted. But this is mostly of academic interest, since Calcite does not use 
HM. If it did, most of the code that is being written today for type inference 
might not have been needed. On the other hand, if you want compatibility with 
other SQL dialects, which do not use HM, you have to reproduce their behaviors 
(which may be undocumented!)

You can have even more general typechecking algorithms, that essentially 
generate systems of equations using type variables and use powerful generic or 
hand-written domain-specific constraint solvers to find solutions. Type systems 
for real languages such as C++, C#, Java, Rust, that have subtyping, generics, 
overloading, and traits, need more powerful algorithms. You can find 
specifications for these type systems, and they are not simple. E.g., for Java 
https://ozanagma.medium.com/type-checking-in-java-b84d6f38ba8f is a short 
description, and https://docs.oracle.com/javase/specs/jls/se8/html/jls-18.html 
for a precise specification.

People literally still write Ph.D. theses about this stuff. It is complicated, 
and easy to get wrong. 

You can argue that the Spark typesystem design is wrong, but we won't know 
until we see it written down. I haven't looked too much for it.

Since SQL is a relatively simple language, the problem of typechecking is not 
as difficult as for the languages I described above. 

Two conclusions:
- Until you write down the rules, you cannot argue about them.
- I think that it will be very difficult (impossible?) for Calcite to match the 
typesystems of all the SQL dialects out there precisely, especially if 
different SQL dialects use different rules given the same expression. See for 
example the discussion we had about the semantics of casting varbinary to 
string: https://issues.apache.org/jira/browse/CALCITE-6210 (although that 
discussion is not about type-checking, but about the runtime interpretation of 
expressions).

> The map_contains_key function may return true when the key and mapkeytype 
> types are different.
> ----------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6352
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6352
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.36.0
>            Reporter: Caican Cai
>            Priority: Critical
>             Fix For: 1.37.0
>
>
>  
> {code:java}
> scala>  val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), 
> 2.0)")
> val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), 
> 2.0): boolean]
> scala> df.show()
> +--------------------------------------+
> |map_contains_key(map(1, a, 2, b), 2.0)|
> +--------------------------------------+
> |                                  true|
> +--------------------------------------+
>  {code}
> calcite return false
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to