can you please look into spark UI and write down how many split the job generates in the first stage of the pipeline, or anywhere else there's signficant variation in # of splits in both cases?
the row similarity is a very short pipeline (in comparison with what would normally be on average). so only the first input re-splitting is critical. The splitting along the products is adjusted by optimizer automatically to match the amount of data segments observed on average in the input(s). e.g. if uyou compute val C = A %*% B and A has 500 elements per split and B has 5000 elements per split then C would approximately have 5000 elements per split (the larger average in binary operator cases). That's approximately how it works. However, the par() that has been added, is messing with initial parallelism which would naturally affect the rest of pipeline per above. I now doubt it was a good thing -- when i suggested Pat to try this, i did not mean to put it _inside_ the algorithm itself, rather, into the accurate input preparation code in his particular case. However, I don't think it will work in any given case. Actually sweet spot parallelism for multioplication unfortunately depends on tons of factors -- network bandwidth and hardware configuration, so it is difficult to give it a good guess universally. More likely, for cli-based prepackaged algorithms (I don't use CLI but rather assemble pipelines in scala via scripting and scala application code) the initial paralellization adjustment options should probably be provided to CLI.