[ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811808#comment-16811808
 ] 

Marco Gaido commented on SPARK-27278:
-------------------------------------

[~huonw] I think the point is: in the existing case which is optimized, the 
{{CreateMap}} operation is removed and replaced with the {{CASE ... WHEN}} 
syntax: this is fine because the code generated by the {{CreateMap}} operation 
is linear in size with the number of elements, exactly as for the {{CASE ... 
WHEN}} approach. So the overall code generated size is similar. While when the 
{{CreateMap}} operation is replaced with a {{Literal}}, this code is not there 
and we replace the for loop present with the list of if statements. So the code 
size is significantly bigger. Despite trivial tests show that the second case 
is faster, having a bigger code generated can lead to huge perf issues in more 
complex scenario (the worts cases are probably when it causes the method size 
to grow bigger than 4k so no JIT is performed anymore, or even it may cause the 
failure to use WholeStageCodegen): so the effect of introducing it in 
performance would be highly query-dependent and hard to predict.

> Optimize GetMapValue when the map is a foldable and the key is not
> ------------------------------------------------------------------
>
>                 Key: SPARK-27278
>                 URL: https://issues.apache.org/jira/browse/SPARK-27278
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Spark 2.4.0
>            Reporter: Huon Wilson
>            Priority: Minor
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */       boolean project_isNull_3 = false;
> /* 038 */       int project_value_3 = -1;
> /* 039 */       if (!false) {
> /* 040 */         project_value_3 = (int) project_expr_0_0;
> /* 041 */       }
> /* 042 */
> /* 043 */       boolean project_value_2 = false;
> /* 044 */       project_value_2 = project_value_3 == 1;
> /* 045 */       if (!false && project_value_2) {
> /* 046 */         project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */         project_project_value_1_0 = 1L;
> /* 048 */         continue;
> /* 049 */       }
> /* 050 */
> /* 051 */       boolean project_isNull_8 = false;
> /* 052 */       int project_value_8 = -1;
> /* 053 */       if (!false) {
> /* 054 */         project_value_8 = (int) project_expr_0_0;
> /* 055 */       }
> /* 056 */
> /* 057 */       boolean project_value_7 = false;
> /* 058 */       project_value_7 = project_value_8 == 2;
> /* 059 */       if (!false && project_value_7) {
> /* 060 */         project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */         project_project_value_1_0 = project_expr_0_0;
> /* 062 */         continue;
> /* 063 */       }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */           int project_index_0 = 0;
> /* 100 */           boolean project_found_0 = false;
> /* 101 */           while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */             final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */             if (project_key_0 == project_value_2) {
> /* 104 */               project_found_0 = true;
> /* 105 */             } else {
> /* 106 */               project_index_0++;
> /* 107 */             }
> /* 108 */           }
> /* 109 */
> /* 110 */           if (!project_found_0) {
> /* 111 */             project_isNull_0 = true;
> /* 112 */           } else {
> /* 113 */             project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */           }
> {code}
> It looks like the problem is in {{SimplifyExtractValueOps}}, which doesn't 
> handle {{GetMapValue(Literal(...), key)}}, only the {{CreateMap}} form:
> {code:scala}
>       case GetMapValue(CreateMap(elems), key) => CaseKeyWhen(key, elems)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to