How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Long, Andrew
Hello everyone, How does Tez calculate the number of mappers and reducers? We have a custom StorageHandler, that when used with tez miscalculates the number of mappers when doing a join. I’ve included an EXPLAIN EXTENDED of a sample query below. One thing I have noticed is that under propert

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Gopal Vijayaraghavan
> While our StorageHandler does utilize a SERDE that correctly returns >SerDeStats, it seems like the optimizer is ignoring these values. AFAIK, the stats impl is assumed to be approximate & aggregate and is never used for setting up execution. > Would anyone know how to correctly set these valu

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Long, Andrew
Ah that makes sense. Thanks again for all the help. Do you know how the number of splits is calculated? I also noticed a couple unusual things in our Splits(as seen below). Primarily getLength() always return 0l, which I’m guessing is possibly causing other problems as well. Also our getSpli

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-24 Thread Gopal Vijayaraghavan
>Do you know how the number of splits is calculated? To do that properly needs a whiteboard and a couple of hours - with the primary complex variable being the YARN headroom calculation. The simplest way to put it would be that it compute splits, tries to find out the available capacity and trie

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-25 Thread Long, Andrew
Thanks once again! “But everything starts off by calling InputFormat::getSplits()” Correct me if I’m wrong but at this point isn’t the number of splits calculated? @Override public InputSplit[] getSplits(JobConf conf, int numSplits) throws IOException { …. After this the splits are then g

Re: How does tez calculate the number of Mappers/Reducers?

2016-06-27 Thread Gopal Vijayaraghavan
>Correct me if I¹m wrong but at this point isn¹t the number of splits >calculated? Yes you are correct, but the grouping kicks in after that. The real reason for grouping is because Shuffle operations are internally MxN and explode out of control if grouping hasn't been done. Running through 50