(sorry for repetition, the list rejects my previous replies due to quoted
message size)

"Auto" just reclusters the input per given _configured cluster capacity_
(there's some safe guard there though i think that doesn't blow up # of
splits if the initial number of splits is ridiculously small though, e.g.
not to recluster 2-split problem into a 300-split problem).

For some algorithms, this is appropriate.

For others such as mmul-bound (A'B) problems, there's a "sweet spot" that i
mentioned due to I/O bandwidth being function of the parallelism  -- which
technically doesn't have anything to do with available cluster capacity. It
is possible that if you do A.par(auto=true).t %*% B.par(auto=true) then you
get a worse performance with 500-task cluster than on 60-task cluster
(depending on the size of operands and product).


> On Thu, Apr 28, 2016 at 11:55 AM, Pat Ferrel <[email protected]>
> wrote:
>
>> Actually on your advice Dmitriy I think these changes went in about 11.
>> Before 11 par was not called. Any clue here?
>>
>> This was in relation to that issue when reading a huge number of part
>> files created by Spark Streaming, which probably trickled down to cause too
>> much parallelization. The auto=true fixed this issue for me but did it have
>> other effects?
>>
>>
>>
>>

Reply via email to