[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

Lefty Leverenz (JIRA) Sun, 18 May 2014 17:02:24 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001270#comment-14001270
 ]


Lefty Leverenz commented on HIVE-4440:
--------------------------------------

Hive 0.13.0 did not remove *hive.mapjoin.bucket.cache.size*.  Also, the comment 
that says it should be removed has a typo in the name of the new parameter -- 
it should be *hive.smbjoin.cache.rows*, not hive.smbjoin.cache.row:

{quote}
+    // hive.mapjoin.bucket.cache.size has been replaced by 
hive.smbjoin.cache.row,
+    // need to remove by hive .13. Also, do not change default (see SMB 
operator)
{quote}

Instead of creating a new jira for this, I'll add a comment on HIVE-6586 (for 
HIVE-6037).

> SMB Operator spills to disk like it's 1999
> ------------------------------------------
>
>                 Key: HIVE-4440
>                 URL: https://issues.apache.org/jira/browse/HIVE-4440
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gunther Hagleitner
>            Assignee: Gunther Hagleitner
>             Fix For: 0.12.0
>
>         Attachments: HIVE-4440.1.patch, HIVE-4440.2.patch
>
>
> I was recently looking into some performance issue with a query that used SMB 
> join and was running really slow. Turns out that the SMB join by default 
> caches only 100 values per key before spilling to disk. That seems overly 
> conservative to me. Changing the parameter resulted in a ~5x speedup - quite 
> significant.
> The parameter is: hive.mapjoin.bucket.cache.size
> Which right now is only used the SMB Operator as far as I can tell.
> The parameter was introduced originally (3 yrs ago) for the map join operator 
> (looks like pre-SMB) and set to 100 to avoid OOM. That seems to have been in 
> a different context though where you had to avoid running out of memory with 
> the cached hash table in the same process, I think.
> Two things I'd like to propose:
> a) Rename it to what it does: hive.smbjoin.cache.rows
> b) Set it to something less restrictive: 10000
> If you string together a 5 table smb join with a map join and a map-side 
> group by aggregation you might still run out of memory, but the renamed 
> parameter should be easier to find and reduce. For most queries, I would 
> think that 10000 is still a reasonable number to cache (On the reduce side we 
> use 25000 for shuffle joins).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4440) SMB Operator spills to disk like it's 1999

Reply via email to