>Thank-you so much for your quick response. Yea, the option is use only
>for hive-on-tez. I want to know its source, its principle.

in.am=true is the better option as it computes the splits after a job has
been submitted.

Imagine you have 3 tables in your query - with in.am=false, all the splits
have to be generated before the 1st task is spun up.

with in.am=true, the 1st task can spin up when at least one of the tables
has already generated splits. GetSplits() is not blocking across all
tables - only within 1 table.

In some cases, you can wait for the 1st task to even finish executing
before starting the split-gen for the 2nd task, producing ~1000x speedups.

For example,

insert into bigtable partition(dt)
select ... from small left outer join bigtable where
date(small.ts) = bigtable.dt and small.txnid = bigtable.txnid
where bigtable.txnid is null
;

With in.am = true + tez DPP, the split-gen is dynamic and will not
generate splits for 100% of big-table (assuming small table is just today).

>Mybe this resource
>“http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/29” is very
>useful,

It has diagrams, but here's an original .pptx

http://people.apache.org/~gopalv/W-235p-Pandey.pptx

MD5 (W-235p-Pandey.pptx) = fd3d5c7eb6360f9654bdfbfb20031ba4


Cheers,
Gopal


Reply via email to