(sorry for repetition, the list rejects my previous replies due to quoted message size)
"Auto" just reclusters the input per given _configured cluster capacity_ (there's some safe guard there though i think that doesn't blow up # of splits if the initial number of splits is ridiculously small though, e.g. not to recluster 2-split problem into a 300-split problem). For some algorithms, this is appropriate. For others such as mmul-bound (A'B) problems, there's a "sweet spot" that i mentioned due to I/O bandwidth being function of the parallelism -- which technically doesn't have anything to do with available cluster capacity. It is possible that if you do A.par(auto=true).t %*% B.par(auto=true) then you get a worse performance with 500-task cluster than on 60-task cluster (depending on the size of operands and product). > On Thu, Apr 28, 2016 at 11:55 AM, Pat Ferrel <[email protected]> > wrote: > >> Actually on your advice Dmitriy I think these changes went in about 11. >> Before 11 par was not called. Any clue here? >> >> This was in relation to that issue when reading a huge number of part >> files created by Spark Streaming, which probably trickled down to cause too >> much parallelization. The auto=true fixed this issue for me but did it have >> other effects? >> >> >> >>
