[ https://issues.apache.org/jira/browse/HIVE-27142?focusedWorklogId=851755&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-851755 ]
ASF GitHub Bot logged work on HIVE-27142: ----------------------------------------- Author: ASF GitHub Bot Created on: 20/Mar/23 10:41 Start Date: 20/Mar/23 10:41 Worklog Time Spent: 10m Work Description: kasakrisz commented on code in PR #4120: URL: https://github.com/apache/hive/pull/4120#discussion_r1141926589 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: ########## @@ -167,21 +167,23 @@ public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, PrunedPartitionList partList = aspCtx.getParseContext().getPrunedPartitions(tsop); ColumnStatsList colStatsCached = aspCtx.getParseContext().getColStatsCached(partList); Table table = tsop.getConf().getTableMetadata(); + boolean skipStatsCollection = table.isNonNative() && !HiveConf.getBoolVar(aspCtx.getConf(), Review Comment: I don't think skipping fetching stats is a good idea at this point. There are non-native storage formats like Iceberg which supports stats and there is an ongoing effort to use them. https://github.com/apache/hive/pull/4000 Issue Time Tracking ------------------- Worklog Id: (was: 851755) Time Spent: 50m (was: 40m) > Map Join not working as expected when joining non-native tables with native > tables > ----------------------------------------------------------------------------------- > > Key: HIVE-27142 > URL: https://issues.apache.org/jira/browse/HIVE-27142 > Project: Hive > Issue Type: Bug > Components: Statistics > Affects Versions: All Versions > Reporter: Syed Shameerur Rahman > Assignee: Syed Shameerur Rahman > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > *1. Issue :* > When *_hive.auto.convert.join=true_* and if the underlying query is trying to > join a large non-native hive table with a small native hive table, The map > join is happening in the wrong side i.e on the map task which process the > small native hive table and it can lead to OOM when the non-native table is > really large and only few map tasks are spawned to scan the small native hive > tables. > > *2. Why is this happening ?* > This happens due to improper stats collection/computation of non native hive > tables. Since the non-native hive tables are actually stored in a different > location which Hive does not know of and only a temporary path which is > visible to Hive while creating a non native table does not store the actual > data, The stats collection logic tend to under estimate the data/rows and > hence causes the map join to happen in the wrong side. > > *3. Potential Solutions* > 3.1 Turn off *_hive.auto.convert.join=false._* This can have a negative > impact of the query if the same query is trying to do multiple joins i.e > one join with non-native tables and other join where both the tables are > native. > 3.2 Compute stats for non-native table by firing the ANALYZE TABLE <> > command before joining native and non-native commands. The user may or may > not choose to do it. > 3.3 Do not collect/estimate stats for non-native hive tables by default > (Preferred solution) -- This message was sent by Atlassian Jira (v8.20.10#820010)