Jianshi Huang created SPARK-5446: ------------------------------------ Summary: Parquet column pruning should work for Map and Struct Key: SPARK-5446 URL: https://issues.apache.org/jira/browse/SPARK-5446 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Jianshi Huang
Consider the following query: {code:sql} select stddev_pop(variables.var1) stddev from model group by model_name {code} Where variables is a Struct containing many fields, similarly it can be a Map with many key-value pairs. During execution, SparkSQL will shuffle the whole map or struct column instead of extracting the value first. The performance is very poor. The optimized version could use a subquery: {code:sql} select stddev_pop(var) stddev from (select variables.var1 as var, model_name from model) m group by model_name {code} Where we extract the field/key-value only in the mapper side, so data being shuffled is small. Parquet already supports reading a single field/key-value in the storage level, but SparkSQL currently doesn’t have optimization for it. This will be very useful optimization for tables having Map or Struct with many columns. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org