[ https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001270#comment-14001270 ]
Lefty Leverenz commented on HIVE-4440: -------------------------------------- Hive 0.13.0 did not remove *hive.mapjoin.bucket.cache.size*. Also, the comment that says it should be removed has a typo in the name of the new parameter -- it should be *hive.smbjoin.cache.rows*, not hive.smbjoin.cache.row: {quote} + // hive.mapjoin.bucket.cache.size has been replaced by hive.smbjoin.cache.row, + // need to remove by hive .13. Also, do not change default (see SMB operator) {quote} Instead of creating a new jira for this, I'll add a comment on HIVE-6586 (for HIVE-6037). > SMB Operator spills to disk like it's 1999 > ------------------------------------------ > > Key: HIVE-4440 > URL: https://issues.apache.org/jira/browse/HIVE-4440 > Project: Hive > Issue Type: Bug > Reporter: Gunther Hagleitner > Assignee: Gunther Hagleitner > Fix For: 0.12.0 > > Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch > > > I was recently looking into some performance issue with a query that used SMB > join and was running really slow. Turns out that the SMB join by default > caches only 100 values per key before spilling to disk. That seems overly > conservative to me. Changing the parameter resulted in a ~5x speedup - quite > significant. > The parameter is: hive.mapjoin.bucket.cache.size > Which right now is only used the SMB Operator as far as I can tell. > The parameter was introduced originally (3 yrs ago) for the map join operator > (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in > a different context though where you had to avoid running out of memory with > the cached hash table in the same process, I think. > Two things I'd like to propose: > a) Rename it to what it does: hive.smbjoin.cache.rows > b) Set it to something less restrictive: 10000 > If you string together a 5 table smb join with a map join and a map-side > group by aggregation you might still run out of memory, but the renamed > parameter should be easier to find and reduce. For most queries, I would > think that 10000 is still a reasonable number to cache (On the reduce side we > use 25000 for shuffle joins). -- This message was sent by Atlassian JIRA (v6.2#6252)