[ https://issues.apache.org/jira/browse/IMPALA-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong updated IMPALA-2522: ---------------------------------- Priority: Major (was: Critical) > Improve the reliability and effectiveness of ETL > ------------------------------------------------ > > Key: IMPALA-2522 > URL: https://issues.apache.org/jira/browse/IMPALA-2522 > Project: IMPALA > Issue Type: Epic > Components: Backend > Affects Versions: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, > Impala 2.6.0, Impala 2.7.0 > Reporter: Mostafa Mokhtar > Assignee: Lars Volker > Priority: Major > Labels: ETL, performance > > h4. Reduce the memory requirements of INSERTs into partitioned tables. > Impala inserts into partitioned Parquet tables suffer from high memory > requirements because each Impala Daemon will keep ~256MB of buffer space per > open partition in the table sink. This often leads to large insert jobs > hitting "Memory limit exceeded" errors. The behavior can be improved by > pre-clustering the data such that only one partition needs to be buffered at > a time in the table sink. > Add a new "clustered" plan hint for insert statements. Example: > {code} > CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); > INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src; > {code} > The hint specifies that the data fed into the table sink should be clustered > based on the partition columns. For now, we'll use a sort to achieve > clustering, and the plan should look like this: > SCAN -> SORT (year,month) -> TABLE SINK > h4. Give users additional control over the insertion order. > In order to improve compression and/or the effectiveness of min/max pruning, > it is desirable to control the order in which rows are inserted into table > (mostly for Parquet). > Introduce a "sortby" plan hint for insert statements: Example > {code} > CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT); > INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ > SELECT * FROM src > {code} > This would produce the following plan: > SCAN -> SORT(year,month,day,hour) -> TABLE SINK > h4. Improve the sort efficiency > The additional sorting step introduced by both solutions above should be as > efficient as possible. > Codegen TupleRowComparator and Tuple::MaterializeExprs. > h4. Summary > With more predictable and resource-efficient ETL users will extract more > value out of Impala and will need to rely less on slow legacy ETL tools like > Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org