[ https://issues.apache.org/jira/browse/IMPALA-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong resolved IMPALA-8214. ----------------------------------- Resolution: Fixed Fix Version/s: Impala 3.2.0 > Bad plan in load_nested.py > -------------------------- > > Key: IMPALA-8214 > URL: https://issues.apache.org/jira/browse/IMPALA-8214 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure > Affects Versions: Impala 3.1.0 > Reporter: Tim Armstrong > Assignee: Tim Armstrong > Priority: Major > Fix For: Impala 3.2.0 > > > The plan for the below SQL, which is executed without stats, has the larger > input on the build side of the join and does a broadcast join, which is very > suboptimal. This causes high memory consumption when loading larger scale > factors, and generally makes the loading process slower than necessary. We > should flip the join and make it a shuffle join. > https://github.com/apache/impala/blob/d481cd4/testdata/bin/load_nested.py#L123 > {code} > tmp_customer_sql = r""" > SELECT > c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, > c_mktsegment, > c_comment, > GROUP_CONCAT( > CONCAT( > CAST(o_orderkey AS STRING), '\003', > CAST(o_orderstatus AS STRING), '\003', > CAST(o_totalprice AS STRING), '\003', > CAST(o_orderdate AS STRING), '\003', > CAST(o_orderpriority AS STRING), '\003', > CAST(o_clerk AS STRING), '\003', > CAST(o_shippriority AS STRING), '\003', > CAST(o_comment AS STRING), '\003', > CAST(lineitems_string AS STRING) > ), '\002' > ) orders_string > FROM {source_db}.customer > LEFT JOIN tmp_orders_string ON c_custkey = o_custkey > WHERE c_custkey % {chunks} = {chunk_idx} > GROUP BY 1, 2, 3, 4, 5, 6, 7, 8""".format(**sql_params) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org