> I have a map join in which the smaller tables together are 200 MB and
>trying to  have one block of main table be processed by one tez task.
...
> What am I missing and is this even the right way of approaching the
>problem ?

You need to be more specific about the Hive version. Hive-13 needs ~6x the
amount of map-join memory for Tez compared to Hive-14.

Hive-1.0 branch is a bit better at estimating map-join sizes as well,
since it counts the memory overheads of JavaDataModel.

Hive-1.1 got a little worse, which will get fixed when we get to hive-1.2.

But for the 1.x line, the approx size of data that fits within a map-join
is (container Xmx - io.sort.mb)/3.

This plays into the NewRatio settings in JDK7 as well, make sure you have
set the new ratio to only 1/8th the memory instead of using 1/3rd default
(which means 30% of your memory cannot be used by the sort buffer or the
map-join since they are tenured data).

Also running ³ANALYZE TABLE <tbl> compute statistics;² on the small tables
will fill in the uncompressed size fields so that we don¹t estimate
map-joins based on zlib sizes (which coincidentally is ~3x off).

And if you still keep getting heap errors, I can take a look at it if you
have a .hprof.bz2 file to share & fix any corner cases we might¹ve missed.

Cheers,
Gopal
PS: The current trunk implements a Grace HashJoin which is another
approach to the memory limit problem - a more traditional solution than
fixing mem sizes.


Reply via email to