[ https://issues.apache.org/jira/browse/HIVE-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309791#comment-16309791 ]
wan kun commented on HIVE-18362: -------------------------------- Hi,[~gopalv] What we do is similar, but there is some difference in the implementation. My implementation is to take the table or partition's ROW_COUNT information directly from hive metastore, which does not need additional calculations. I also have a few questions to ask: 1. Why do you use NDV instead of using ROW_COUNT directly ? I think NDV will be less than the actual number of ROW, but the actual memory is linearly related to the number of ROW. 2., I'm sorry, I haven't had the test environment of hive 2.* for a while. Hive branch-2.* depends on ColStatistics's statistics. Can you tell me where does ColStatistics come from ? Is this nesessary to add extra calculation for additional column statistics before our job? 3. The checkNumberOfEntriesForHashTable function only checks the number of Entry of one RS at a time. Does it happen that multiple map table is loaded into memory together, resulting in OOM? There are also two following questions: 1. ConvertJoinMapJoin optimization is only used in TezCompiler ? Spark use SparkMapJoinOptimizer. There is no optimizer for MapReduce ? 2. in hive branch-1.2 does not have this part of the code (but this parameter is added in hive-default.xml.template, which should not be effective) > Introduce a parameter to control the max row number for map join convertion > --------------------------------------------------------------------------- > > Key: HIVE-18362 > URL: https://issues.apache.org/jira/browse/HIVE-18362 > Project: Hive > Issue Type: Bug > Components: Query Processor > Reporter: wan kun > Assignee: Gopal V > Priority: Minor > Attachments: HIVE-18362-branch-1.2.patch > > > The compression ratio of the Orc compressed file will be very high in some > cases. > The test table has three Int columns, with twelve million records, but the > compressed file size is only 4M. Hive will automatically converts the Join to > Map join, but this will cause memory overflow. So I think it is better to > have a parameter to limit to the total number of table records in the Map > Join convertion, and if the total number of records is larger than that, it > can not be converted to Map join. > *hive.auto.convert.join.max.number = 2500000L* > The default value for this parameter is 2500000, because so many records > occupy about 700M memory in clint JVM, and 2500000 records for Map Join are > also large tables. -- This message was sent by Atlassian JIRA (v6.4.14#64029)