[ 
https://issues.apache.org/jira/browse/HIVE-18362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309791#comment-16309791
 ] 

wan kun commented on HIVE-18362:
--------------------------------

Hi,[~gopalv]
What we do is similar, but there is some difference in the implementation. My 
implementation is to take the table or partition's ROW_COUNT information 
directly from hive metastore, which does not need additional calculations.

I also have a few questions to ask:

1. Why do you use NDV instead of using ROW_COUNT directly ? I think NDV will be 
less than the actual number of ROW, but the actual memory is linearly related 
to the number of ROW.
2., I'm sorry, I haven't had the test environment of hive 2.* for a while. Hive 
branch-2.* depends on ColStatistics's statistics. Can you tell me where does 
ColStatistics come from ? Is this nesessary to add extra calculation for 
additional column statistics before our job?
3. The checkNumberOfEntriesForHashTable function only checks the number of 
Entry of one RS at a time. Does it happen that multiple map table is loaded 
into memory together, resulting in OOM?

There are also two following questions:
1. ConvertJoinMapJoin optimization  is only used in TezCompiler ? Spark use 
SparkMapJoinOptimizer. There is no optimizer for MapReduce ?
2. in hive branch-1.2 does not have this part of the code (but this parameter 
is added in hive-default.xml.template, which should not be effective)

> Introduce a parameter to control the max row number for map join convertion
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-18362
>                 URL: https://issues.apache.org/jira/browse/HIVE-18362
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: wan kun
>            Assignee: Gopal V
>            Priority: Minor
>         Attachments: HIVE-18362-branch-1.2.patch
>
>
> The compression ratio of the Orc compressed file will be very high in some 
> cases.
> The test table has three Int columns, with twelve million records, but the 
> compressed file size is only 4M. Hive will automatically converts the Join to 
> Map join, but this will cause memory overflow. So I think it is better to 
> have a parameter to limit to the total number of table records in the Map 
> Join convertion, and if the total number of records is larger than that, it 
> can not be converted to Map join.
> *hive.auto.convert.join.max.number = 2500000L*
> The default value for this parameter is 2500000, because so many records 
> occupy about 700M memory in clint JVM, and 2500000 records for Map Join are 
> also large tables.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to