Hi all,
I have a query that involves 2 tables, t1 with 45M rows (not partitioned) and 
t2 with 61M rows (partitioned).

If I disable the parameter hive.auto.convert.join.noconditionaltask the query 
runs well, but, if I enable it, using the default value of 
hive.auto.convert.join.noconditionaltask.size (1145324612) the query fails.

The explain of the query tell me that the optimizer does a broadcast join 
because, based on table statistics, t1 should be eligible for broadcasting 
(Data size 564604302 bytes). In fact, if I set 
hive.auto.convert.join.noconditionaltask.size to a value less 564604302, the 
optimizer doesn't convert the join and it works well.

The problem is that, enabling the auto convert join with default values, the 
query fails because of JVM errors on Tez containers. If I look at Map Vertex's 
metrics, I see the following:

OUTPUT_RECORDS                                    42733890
OUTPUT_BYTES                                          1978132995
OUTPUT_BYTES_PHYSICAL                       404057386
OUTPUT_BYTES_WITH_OVERHEAD        2063600823

So, my doubt is: is it possible that statistics on the table are incorrect, so 
the optimizer thinks that it is possible to broadcast the resulting table, but 
when the query is running it has more bytes, breaking the task's JVM?

I already tried to execute an ANALYZE TABLE t1 COMPUTES STATISTICS FOR COLUMNS 
(c1, c2,...) but nothing changed.
Also, running the explain, I see that the optimizer doesn't use statistics for 
columns:

Statistics:Num rows: 176037 Data size: 564604302 Basic stats: COMPLETE Column 
stats: NONE

Tez.task.size is 4096Mb, and it is not an option to increase the size. Also, if 
possible, I don't want to decrease the size of 
hive.auto.convert.join.noconditionaltask.size, because it is set to 26% of 
tez.task.size, as from best practices.

I'm using Hive 1.2

Thank you





DXC Technology Company -- This message is transmitted to you by or on behalf of 
DXC Technology Company or one of its affiliates. It is intended exclusively for 
the addressee. The substance of this message, along with any attachments, may 
contain proprietary, confidential or privileged information or information that 
is otherwise legally exempt from disclosure. Any unauthorized review, use, 
disclosure or distribution is prohibited. If you are not the intended recipient 
of this message, you are not authorized to read, print, retain, copy or 
disseminate any part of this message. If you have received this message in 
error, please destroy and delete all copies and notify the sender by return 
e-mail. Regardless of content, this e-mail shall not operate to bind DXC 
Technology Company or any of its affiliates to any order or other contract 
unless pursuant to explicit written agreement or government initiative 
expressly permitting the use of e-mail for such purpose.

Reply via email to