Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Dmitriy Lyubimov Fri, 29 Apr 2016 09:24:17 -0700

yes -- i would do it as an optional option -- just like par does -- do
nothing; try auto, or try exact number of splits


On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> It’s certainly easy to put this in the driver, taking it out of the algo.
>
> Dmitriy, is it a candidate for an Option param to the algo? That would
> catch cases where people rely on it now (like my old DStream example) but
> easily allow it to be overridden to None to imitate pre 0.11, or passed in
> when the app knows better.
>
> Nikaash, are you in a position to comment out the .par(auto=true) and see
> if it makes a difference?
>
>
> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> can you please look into spark UI and write down how many split the job
> generates in the first stage of the pipeline, or anywhere else there's
> signficant variation in # of splits in both cases?
>
> the row similarity is a very short pipeline (in comparison with what would
> normally be on average). so only the first input re-splitting is critical.
>
> The splitting along the products is adjusted by optimizer automatically to
> match the amount of data segments observed on average in the input(s). e.g.
> if uyou compute val C = A %*% B and A has 500 elements per split and B has
> 5000 elements per split then C would approximately have 5000 elements per
> split (the larger average in binary operator cases).  That's approximately
> how it works.
>
> However, the par() that has been added, is messing with initial parallelism
> which would naturally affect the rest of pipeline per above. I now doubt it
> was a good thing -- when i suggested Pat to try this, i did not mean to put
> it _inside_ the algorithm itself, rather, into the accurate input
> preparation code in his particular case. However, I don't think it will
> work in any given case. Actually sweet spot parallelism for multioplication
> unfortunately depends on tons of factors -- network bandwidth and hardware
> configuration, so it is difficult to give it a good guess universally. More
> likely, for cli-based prepackaged algorithms (I don't use CLI but rather
> assemble pipelines in scala via scripting and scala application code) the
> initial paralellization adjustment options should probably be provided to
> CLI.
>
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Reply via email to