>Thank-you so much for your quick response. Yea, the option is use only >for hive-on-tez. I want to know its source, its principle.
in.am=true is the better option as it computes the splits after a job has been submitted. Imagine you have 3 tables in your query - with in.am=false, all the splits have to be generated before the 1st task is spun up. with in.am=true, the 1st task can spin up when at least one of the tables has already generated splits. GetSplits() is not blocking across all tables - only within 1 table. In some cases, you can wait for the 1st task to even finish executing before starting the split-gen for the 2nd task, producing ~1000x speedups. For example, insert into bigtable partition(dt) select ... from small left outer join bigtable where date(small.ts) = bigtable.dt and small.txnid = bigtable.txnid where bigtable.txnid is null ; With in.am = true + tez DPP, the split-gen is dynamic and will not generate splits for 100% of big-table (assuming small table is just today). >Mybe this resource >“http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/29” is very >useful, It has diagrams, but here's an original .pptx http://people.apache.org/~gopalv/W-235p-Pandey.pptx MD5 (W-235p-Pandey.pptx) = fd3d5c7eb6360f9654bdfbfb20031ba4 Cheers, Gopal