Hello everyone,
How does Tez calculate the number of mappers and reducers? We have a custom
StorageHandler, that when used with tez miscalculates the number of mappers
when doing a join. I’ve included an EXPLAIN EXTENDED of a sample query below.
One thing I have noticed is that under propert
> While our StorageHandler does utilize a SERDE that correctly returns
>SerDeStats, it seems like the optimizer
is ignoring these values.
AFAIK, the stats impl is assumed to be approximate & aggregate and is
never used for setting up execution.
> Would anyone know how to correctly set these valu
Ah that makes sense. Thanks again for all the help.
Do you know how the number of splits is calculated?
I also noticed a couple unusual things in our Splits(as seen below). Primarily
getLength() always return 0l, which I’m guessing is possibly causing other
problems as well. Also our getSpli
>Do you know how the number of splits is calculated?
To do that properly needs a whiteboard and a couple of hours - with the
primary complex variable being the YARN headroom calculation.
The simplest way to put it would be that it compute splits, tries to find
out the available capacity and trie
Thanks once again!
“But everything starts off by calling InputFormat::getSplits()”
Correct me if I’m wrong but at this point isn’t the number of splits calculated?
@Override
public InputSplit[] getSplits(JobConf conf, int numSplits) throws
IOException {
….
After this the splits are then g
>Correct me if I¹m wrong but at this point isn¹t the number of splits
>calculated?
Yes you are correct, but the grouping kicks in after that.
The real reason for grouping is because Shuffle operations are internally
MxN and explode out of control if grouping hasn't been done.
Running through 50