Hi, Sure, I’ll do some more detailed analysis of the jobs on the UI and share screenshots if possible.
Pat, yup, I’ll only be able to get to this on Monday, though. I’ll comment out the line and see the difference in performance. Thanks so much for helping guys, I really appreciate it. Also, the algorithm implementation for LLR is extremely performant, at least as of Mahout 0.10. I ran some tests for around 61 days of data (which in our case is a fair amount) and the model was built in about 20 minutes, which is pretty amazing. This was using a pretty decent sized cluster, though. Thank you, Nikaash Puri > On 29-Apr-2016, at 10:18 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > There are some other changes I want to make for the next rev so I’ll do that. > > Nikaash, it would still be nice to verify this fixes your problem, also if > you want to create a Jira it will guarantee I don’t forget. > > > On Apr 29, 2016, at 9:23 AM, Dmitriy Lyubimov <dlie...@gmail.com > <mailto:dlie...@gmail.com>> wrote: > > yes -- i would do it as an optional option -- just like par does -- do > nothing; try auto, or try exact number of splits > > On Fri, Apr 29, 2016 at 9:15 AM, Pat Ferrel <p...@occamsmachete.com > <mailto:p...@occamsmachete.com>> wrote: > It’s certainly easy to put this in the driver, taking it out of the algo. > > Dmitriy, is it a candidate for an Option param to the algo? That would catch > cases where people rely on it now (like my old DStream example) but easily > allow it to be overridden to None to imitate pre 0.11, or passed in when the > app knows better. > > Nikaash, are you in a position to comment out the .par(auto=true) and see if > it makes a difference? > > > On Apr 29, 2016, at 8:53 AM, Dmitriy Lyubimov <dlie...@gmail.com > <mailto:dlie...@gmail.com>> wrote: > > can you please look into spark UI and write down how many split the job > generates in the first stage of the pipeline, or anywhere else there's > signficant variation in # of splits in both cases? > > the row similarity is a very short pipeline (in comparison with what would > normally be on average). so only the first input re-splitting is critical. > > The splitting along the products is adjusted by optimizer automatically to > match the amount of data segments observed on average in the input(s). e.g. > if uyou compute val C = A %*% B and A has 500 elements per split and B has > 5000 elements per split then C would approximately have 5000 elements per > split (the larger average in binary operator cases). That's approximately > how it works. > > However, the par() that has been added, is messing with initial parallelism > which would naturally affect the rest of pipeline per above. I now doubt it > was a good thing -- when i suggested Pat to try this, i did not mean to put > it _inside_ the algorithm itself, rather, into the accurate input > preparation code in his particular case. However, I don't think it will > work in any given case. Actually sweet spot parallelism for multioplication > unfortunately depends on tons of factors -- network bandwidth and hardware > configuration, so it is difficult to give it a good guess universally. More > likely, for cli-based prepackaged algorithms (I don't use CLI but rather > assemble pipelines in scala via scripting and scala application code) the > initial paralellization adjustment options should probably be provided to > CLI. > > >