Hey,

by using an UDTF Hive should also be prepared that it will return more rows 
from that single row - which is not true in this case: because 1 log entry will 
be 1 result row.
Note that the CBO does not seem to support UDTFs (I've tried apache/master with 
a lateral view right now - and it failed).
I think most likely the biggest problem you are facing is that the cbo doesn't 
run on your query at all.

create table tx2 as select named_struct('i',1,'str','bla') as s;
explain select * from tx2 lateral view inline(array(s)) s1;
...
| Plan not optimized by CBO.                         |
...

Have you tried returning with a struct?
explain select s.i,s.str from tx2 t;
| Plan not optimized by CBO.                         |

for your case you should repeat the udf call multiple times
explain select 
named_struct('i',1,'str','bla').i,named_struct('i',1,'str','bla').str ;
or
explain select s.i,s.str from (select named_struct('i',1,'str','bla') s) t;


cheers,
Zoltan


On 12/13/21 12:49 PM, sam wrote:
Hi team,

I am trying to learn the CBO of hive because I need to make some performance 
tuning for my ETL job.

I find a confluence doc below, but I am not sure if it is the newest version, 
can anyone help to confirm that?
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive 
<https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive>

Another question is that we develop some UDTF help us to parse log like:
select
my-udtf(log) as (id
,name
,time)
from tb_log
So do you have any other better idea for this scenario?

BTW, the version of Hive we used is above 3.0. My data increase by PB every day.


Thanks in advance,

Samuel







Reply via email to