[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.
[ https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843235#comment-17843235 ] Caican Cai commented on CALCITE-6352: - {code:java} scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2.0, 'b'), 2)") df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2.0, b), 2): boolean]scala> df.show() +--+ |map_contains_key(map(1, a, 2.0, b), 2)| +--+ | true| +--+ scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2.0, 'b'), 2f)") df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2.0, b), 2.0): boolean]scala> df.show() ++ |map_contains_key(map(1, a, 2.0, b), 2.0)| ++ | true| ++ {code} [~mbudiu] Hello, It seems that as long as the numbers are the same, true can be returned, regardless of whether the number type is double, int, or float. > The map_contains_key function may return true when the key and mapkeytype > types are different. > -- > > Key: CALCITE-6352 > URL: https://issues.apache.org/jira/browse/CALCITE-6352 > Project: Calcite > Issue Type: Bug > Components: core >Affects Versions: 1.36.0 >Reporter: Caican Cai >Assignee: Caican Cai >Priority: Critical > Fix For: 1.37.0 > > > > {code:java} > scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), > 2.0)") > val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), > 2.0): boolean] > scala> df.show() > +--+ > |map_contains_key(map(1, a, 2, b), 2.0)| > +--+ > | true| > +--+ > {code} > calcite return false > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.
[ https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837300#comment-17837300 ] Caican Cai commented on CALCITE-6352: - I think there are two steps to solving the problem . The first step is to determine the type judgment rules for map_contain_key in spark. The second step is to handle these special situations specially. > The map_contains_key function may return true when the key and mapkeytype > types are different. > -- > > Key: CALCITE-6352 > URL: https://issues.apache.org/jira/browse/CALCITE-6352 > Project: Calcite > Issue Type: Bug > Components: core >Affects Versions: 1.36.0 >Reporter: Caican Cai >Assignee: Caican Cai >Priority: Critical > Fix For: 1.37.0 > > > > {code:java} > scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), > 2.0)") > val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), > 2.0): boolean] > scala> df.show() > +--+ > |map_contains_key(map(1, a, 2, b), 2.0)| > +--+ > | true| > +--+ > {code} > calcite return false > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.
[ https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836820#comment-17836820 ] Caican Cai commented on CALCITE-6352: - [~mbudiu] Thank you. I will fix it > The map_contains_key function may return true when the key and mapkeytype > types are different. > -- > > Key: CALCITE-6352 > URL: https://issues.apache.org/jira/browse/CALCITE-6352 > Project: Calcite > Issue Type: Bug > Components: core >Affects Versions: 1.36.0 >Reporter: Caican Cai >Assignee: Caican Cai >Priority: Critical > Fix For: 1.37.0 > > > > {code:java} > scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), > 2.0)") > val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), > 2.0): boolean] > scala> df.show() > +--+ > |map_contains_key(map(1, a, 2, b), 2.0)| > +--+ > | true| > +--+ > {code} > calcite return false > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.
[ https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834028#comment-17834028 ] Mihai Budiu commented on CALCITE-6352: -- To decide whether this program is correct you have first to define the type system of the language that is being used. I haven't seen a type specification of Spark SQL. The type system describes the types being used, the type assigned to each expression, and the implicit type conversion rules. (Note that Calcite has an interface called RelDataTypeSystem, but it only covers a subset of the functionality of what other people call a "type system"; for example, it does not contain the type inference rules.) In a language like Java, inheriting from the original Java without generics, all Maps are just Map under the hood. So the above program would be legal (accepted by the type system), but the result returned would be "false", because we have: {code:java} !((Integer)2).equals((Double)2). {code{ The same would happen in a dynamically-typed language like Python. In Java you can also define your own strongly-typed map: {code:java} class M { Map m = new HashMap<>(); String get(K key) { return this.m.get(key); } } {code} The Java type checker then will reject an expression like {code:java} new M().get(2.0) {code} as having an illegal type. In a strongly-typed language there are rules for typechecking expressions. The typechecking rules describe how expression types are computed based on the expression syntactic structure, and which expressions have legal types. They also describe when an expression requires inserting implicit casts to be interpreted as a legal expression. I imagine that the Spark typechecker converts the above expression to something like {code:java} map_contains_key(map(1, 'a', 2, 'b'), (Integer)2.0) {code} which would typecheck, and return 'true'. (Alternatively, it could convert this to {code:java} map_contains_key(map((Double)1, 'a', (Double)2, 'b'), 2.0) {code} which would also return 'true'.) As you know, Calcite is unique in that there is no overall type system that dictates these rules, each function and operator has its own rules for typechecking and inserting implicit casts. Moreover, and very unfortunately, although Calcite infers implicit casts, it does *not* insert them as *explicit* casts in the program representation, which I think is a bug, but which it's too late to fix. There are some simple typesystems that could have been used. The simplest typesystem simply works bottom-up, inferring the type of an expression from the types of its subexpressions. That is probably adequate for classic SQL, but runs into problems when you add overloaded functions and generic data structures, such as arrays and maps. There are some generic typechecking algorithms that can be used in the presence of such features; a classic algorithm is the Hindley-Milner type unification algorithm https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system. In that algorithm some expressions have an "unknown" type, represented by a type variable. The algorithm finds the "most general type" of any expression. The most general type can also be used to dictate where implicit casts need to be inserted. But this is mostly of academic interest, since Calcite does not use HM. If it did, most of the code that is being written today for type inference might not have been needed. On the other hand, if you want compatibility with other SQL dialects, which do not use HM, you have to reproduce their behaviors (which may be undocumented!) You can have even more general typechecking algorithms, that essentially generate systems of equations using type variables and use powerful generic or hand-written domain-specific constraint solvers to find solutions. Type systems for real languages such as C++, C#, Java, Rust, that have subtyping, generics, overloading, and traits, need more powerful algorithms. You can find specifications for these type systems, and they are not simple. E.g., for Java https://ozanagma.medium.com/type-checking-in-java-b84d6f38ba8f is a short description, and https://docs.oracle.com/javase/specs/jls/se8/html/jls-18.html for a precise specification. People literally still write Ph.D. theses about this stuff. It is complicated, and easy to get wrong. You can argue that the Spark typesystem design is wrong, but we won't know until we see it written down. I haven't looked too much for it. Since SQL is a relatively simple language, the problem of typechecking is not as difficult as for the languages I described above. Two conclusions: - Until you write down the rules, you cannot argue about them. - I think that it will be very difficult (impossible?) for Calcite to match the typesystems of all the SQL dialects out there precisely, especially if di
[jira] [Commented] (CALCITE-6352) The map_contains_key function may return true when the key and mapkeytype types are different.
[ https://issues.apache.org/jira/browse/CALCITE-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833966#comment-17833966 ] Caican Cai commented on CALCITE-6352: - [~mbudiu] I think this is also inconsistent with the common sense of compilation principles, but spark does return true > The map_contains_key function may return true when the key and mapkeytype > types are different. > -- > > Key: CALCITE-6352 > URL: https://issues.apache.org/jira/browse/CALCITE-6352 > Project: Calcite > Issue Type: Bug > Components: core >Affects Versions: 1.36.0 >Reporter: Caican Cai >Priority: Critical > Fix For: 1.37.0 > > > > {code:java} > scala> val df = spark.sql("select map_contains_key(map(1, 'a', 2, 'b'), > 2.0)") > val df: org.apache.spark.sql.DataFrame = [map_contains_key(map(1, a, 2, b), > 2.0): boolean] > scala> df.show() > +--+ > |map_contains_key(map(1, a, 2, b), 2.0)| > +--+ > | true| > +--+ > {code} > calcite return false > -- This message was sent by Atlassian Jira (v8.20.10#820010)