[ https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-29031: ---------------------------------- Affects Version/s: (was: 3.0.0) 3.1.0 > Materialized column to accelerate queries > ----------------------------------------- > > Key: SPARK-29031 > URL: https://issues.apache.org/jira/browse/SPARK-29031 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.1.0 > Reporter: Jason Guo > Priority: Major > Labels: SPIP > > Goals > * Add a new SQL grammar of Materialized column > * Implicitly rewrite SQL queries on the complex type of columns if there is > a materialized columns for it > * If the data type of the materialized columns is atomic type, even though > the origin column type is in complex type, enable vectorized read and filter > pushdown to improve performance > Example > Create a normal table > {quote}CREATE TABLE x ( > name STRING, > age INT, > params STRING, > event MAP<STRING, STRING> > ) USING parquet; > {quote} > > Add materialized columns to an existing table > {quote}ALTER TABLE x ADD COLUMNS ( > new_age INT MATERIALIZED age + 1, > city STRING MATERIALIZED get_json_object(params, '$.city'), > label STRING MATERIALIZED event['label'] > ); > {quote} > > When issue a query as below > {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label'] > FROM x > WHER event['label'] = 'newuser'; > {quote} > It's equivalent to > {quote}SELECT name, new_age, city, label > FROM x > WHERE label = 'newuser'; > {quote} > > The query performance improved dramatically because > # The new query (after rewritten) will read the new column city (in string > type) instead of read the whole map of params(in map string). Much lesser > data are need to read > # Vectorized read can be utilized in the new query and can not be used in > the old one. Because vectorized read can only be enabled when all required > columns are in atomic type > # Filter can be pushdown. Only filters on atomic column can be pushdown. The > original filter event['label'] = 'newuser' is on complex column, so it can > not be pushdown. > # The new query do not need to parse JSON any more. JSON parse is a CPU > intensive operation which will impact performance dramatically > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org