Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Pat Ferrel Fri, 29 Apr 2016 12:07:35 -0700

Right, will do. But Nakaash if you could just comment out those lines and see 
if it has an effect it would be informative and even perhaps solve your problem 
sooner than my changes. No great rush. Playing around with different values, as 
Dmitriy says, might yield better results and for that you can mess with the 
code or wait for my changes.


Yeah, it’s fast enough in most cases. The main work is the optimized A’A, A’B 
stuff in the BLAS optimizer Dmitriy put in. It is something like 10x faster 
than a similar algo in Hadoop MR. This particular calc and generalization is 
not in any other Spark or now Flink lib that I know of.


On Apr 29, 2016, at 11:24 AM, Dmitriy Lyubimov <[email protected]> wrote:

Nikaash,

yes unfortunately you may need to play with parallelism for your particular
load/cluster manually to get the best out of it. I guess Pat will be adding
the option.

On Fri, Apr 29, 2016 at 11:14 AM, Nikaash Puri <[email protected]>
wrote:

> Hi,
> 
> Sure, I’ll do some more detailed analysis of the jobs on the UI and share
> screenshots if possible.
> 
> Pat, yup, I’ll only be able to get to this on Monday, though. I’ll comment
> out the line and see the difference in performance.
> 
> Thanks so much for helping guys, I really appreciate it.
> 
> Also, the algorithm implementation for LLR is extremely performant, at
> least as of Mahout 0.10. I ran some tests for around 61 days of data (which
> in our case is a fair amount) and the model was built in about 20 minutes,
> which is pretty amazing. This was using a pretty decent sized cluster,
> though.
> 
> Thank you,
> Nikaash Puri
> 
> On 29-Apr-2016, at 10:18 PM, Pat Ferrel <[email protected]> wrote:
> 
> There are some other changes I want to make for the next rev so I’ll do
> that.
> 
> Nikaash, it would still be nice to verify this fixes your problem, also if
> you want to create a Jira it will guarantee I don’t forget.
> 
> 
> On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> yes -- i would do it as an optional option -- just like par does -- do
> nothing; try auto, or try exact number of splits
> 
> On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <[email protected]> wrote:
> 
>> It’s certainly easy to put this in the driver, taking it out of the algo.
>> 
>> Dmitriy, is it a candidate for an Option param to the algo? That would
>> catch cases where people rely on it now (like my old DStream example) but
>> easily allow it to be overridden to None to imitate pre 0.11, or passed in
>> when the app knows better.
>> 
>> Nikaash, are you in a position to comment out the .par(auto=true) and see
>> if it makes a difference?
>> 
>> 
>> On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <[email protected]> wrote:
>> 
>> can you please look into spark UI and write down how many split the job
>> generates in the first stage of the pipeline, or anywhere else there's
>> signficant variation in # of splits in both cases?
>> 
>> the row similarity is a very short pipeline (in comparison with what would
>> normally be on average). so only the first input re-splitting is critical.
>> 
>> The splitting along the products is adjusted by optimizer automatically to
>> match the amount of data segments observed on average in the input(s).
>> e.g.
>> if uyou compute val C = A %*% B and A has 500 elements per split and B has
>> 5000 elements per split then C would approximately have 5000 elements per
>> split (the larger average in binary operator cases).  That's approximately
>> how it works.
>> 
>> However, the par() that has been added, is messing with initial
>> parallelism
>> which would naturally affect the rest of pipeline per above. I now doubt
>> it
>> was a good thing -- when i suggested Pat to try this, i did not mean to
>> put
>> it _inside_ the algorithm itself, rather, into the accurate input
>> preparation code in his particular case. However, I don't think it will
>> work in any given case. Actually sweet spot parallelism for
>> multioplication
>> unfortunately depends on tons of factors -- network bandwidth and hardware
>> configuration, so it is difficult to give it a good guess universally.
>> More
>> likely, for cli-based prepackaged algorithms (I don't use CLI but rather
>> assemble pipelines in scala via scripting and scala application code) the
>> initial paralellization adjustment options should probably be provided to
>> CLI.
>> 
>> 
> 
> 
>

Re: spark-itemsimilarity runs orders of times slower from Mahout 0.11 onwards

Reply via email to