[
https://issues.apache.org/jira/browse/HIVE-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223741#comment-14223741
]
Hari Sankar Sivarama Subramaniyan commented on HIVE-7751:
---------------------------------------------------------
Revisiting this issue.
> Mapjoin set in a non-conditional task can fail in MR mode because of memory
> overhead issues
> ---------------------------------------------------------------------------------------------
>
> Key: HIVE-7751
> URL: https://issues.apache.org/jira/browse/HIVE-7751
> Project: Hive
> Issue Type: Bug
> Reporter: Hari Sankar Sivarama Subramaniyan
> Assignee: Hari Sankar Sivarama Subramaniyan
>
> select sum(ss_quantity) from store_sales join store on store.s_store_sk =
> store_sales.ss_store_sk join customer_demographics on
> customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join
> customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk
> join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where
> d_year = 2000 and ((cd_marital_status = 'M' and cd_education_status =
> 'Advanced Degree' and ss_sales_price between 100.00 and 150.00) or
> (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and
> ss_sales_price between 50.00 and 100.00) or (cd_marital_status = 'M' and
> cd_education_status = 'Advanced Degree' and ss_sales_price between 150.00 and
> 200.00)) and ((ca_country = 'United States' and ca_state in ('TX', 'OH',
> 'TX') and ss_net_profit between 0 and 2000) or (ca_country = 'United States'
> and ca_state in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or
> (ca_country = 'United States' and ca_state in ('VA', 'TX', 'MS') and
> ss_net_profit between 50 and 25000));
> The above query where the data is stored as orc format can fail because we
> convert the above join to a non-conditional task assuming that mapjoin would
> succeed at runtime. But at runtime, the query can fail due to memory overhead
> issues. The improvement to prevent such failures would be to use table
> statistics instead of calling ql.exec.Utilities.getTotalInputFileSize()
> inside the CommonJoinTaskDispatcher. This would make sure that we take better
> decisions for MR mode. Tez on the other hand would handle such scenarios
> better because it actaully relies on table stats to get the data size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)