[ https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267689#comment-15267689 ]
Davies Liu commented on SPARK-13753: ------------------------------------ After looking at the query, the bug is caused by we though the key of MapType should always be not-nullable, but when create an map using map(), we do not check the nullability of keys. So the solution could be 1) enforce the nullability check in map(), which will break this use case, 2) or allow `null` as key in MapType, which may require more API changes cc [~rxin] [~marmbrus] [~yhuai] > Column nullable is derived incorrectly > -------------------------------------- > > Key: SPARK-13753 > URL: https://issues.apache.org/jira/browse/SPARK-13753 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2 > Reporter: Jingwei Lu > Priority: Critical > > There is a problem in spark sql to derive nullable column and used in > optimization incorrectly. In following query: > {code} > select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0] > from ( > select explode(map(a.frontend[0], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p50"), > a.frontend[1], > ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), > ",action:", COALESCE(action, "null")), ".p90"), > a.backend[0], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.backend[1], ARRAY(concat("metric:backend", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.render[0], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p50"), > a.render[1], ARRAY(concat("metric:render", > ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, > "null")), ".p90"), > a.page_load_time[0], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.page_load_time[1], > ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"), > a.total_load_time[0], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p50"), > a.total_load_time[1], > ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, > "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags) > from ( > select data.controller as controller, data.action as > action, > percentile(data.frontend, array(0.5, 0.9)) as > frontend, > percentile(data.backend, array(0.5, 0.9)) as > backend, > percentile(data.render, array(0.5, 0.9)) as render, > percentile(data.page_load_time, array(0.5, 0.9)) as > page_load_time, > percentile(data.total_load_time, array(0.5, 0.9)) > as total_load_time > from air_events_rt > where type='air_events' and data.event_name='pageload' > group by data.controller, data.action > ) a > ) b > where b.value is not null > {code} > b.value is incorrectly derived as not nullable. "b.value is not null" > predicate will be ignored by optimizer which cause the query return incorrect > result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org