Ok, I'll read about it. Thanks for Your help!
Sincerely, Igor Kasianov 2016-11-22 17:28 GMT+02:00 Pat Ferrel <[email protected]>: > No tuning is “obviously good”. Tuning is per dataset and for your cluster. > I only said what works for me in other use cases. > > Some operations occur in one task per machine and some in one task per > cluster. This is the nature of the task itself. See descriptions of them in > Spark docs. > > If you want to change partitioning for the IndexedDataset (or other > derivative class) cast it as an IndexedDatasetSpark then get the internal > RDD and do a .repartition. If you use defaultParalelism, then you have a > way to experiment from the command line without changing code. > > The Mahout parOpts are usable but I don’t know how they work so do the > research. I put them in for people who might want to use them. I > fundamentally don’t like the virtualization of the compute engines in > Mahout because it is not necessarily a one-to-one match with Spark tuning, > it is also not very well documented so I avoid it. I once asked about the > .par function for Mahout DRMs and got a page long description that I took > nothing useful from. > > > On Nov 22, 2016, at 1:13 AM, Igor Kasianov <[email protected]> wrote: > > Thanks for Your reply! > > Firstly consider previuos mail, about defaultParalelism > When I set paralelism to 12 (when I have 12 cores), than training take > about 6.5 hours > When I set 12 x 4 = 48, train takes much more time (I have stoped it after > 9 hours) > When I set paralellism level to 12: > most of stages have 12 tasks, but > The stage with cooccurrenceIDs (reduce by keys -> filter in package.scala) > only 3 and take 2.5 hours (fastest of two), > When I set parelellism level to 48 > most stage have 48 tasks, but the stage with coocurrenceIDs 11 and > (fastest of two takes 4.5 hours) > > So, > 1) it seems that increase paralelism level to number of cores X 4 is not > obviously a good idea. > > 2) I'd like to test the level of paralelism = number of cores, but also > set the same level for coocurenceIDs, I have played with ParOpts, but > unfortunatelly it had no effect. I am 'inspired' with Your optimistic > assessment consider restriction of using ParOpts, but how can I learn > more about it? Only from code? > > Once more thanks for your help. > > Sincerely, > Igor Kasianov > > 2016-11-21 18:59 GMT+02:00 Pat Ferrel <[email protected]>: > >> Do not use ParOpts unless you understand Mahout’s use of them better than >> I do and I’m a committer. >> >> Mahout tries to define it’s own meta-engine optimizations and they do not >> directly map to Spark. Mahout runs on several backend engines like Spark >> and Flink. ParOpts needs to be understood from Mahout so I only use >> .repartition and when the input is repartitioned, this carries through to >> all operations performed on it. >> >> There is a .distinct.collect for ids only that creates a BiMap of ids and >> this requires a phase go through one machine but this leads to huge >> performance benefits in several other stages. Scaling your Spark cluster is >> the best way to in increase speed for this phase. There are several >> optimizations already made in dealing with ids, for instance the BiMap is >> created only once for all users and broadcast to executors. The math only >> works out if the user space is identical for all input event types so we >> only calculate them once for the conversion event. Item ids must be created >> for every event since the events may have different item types. >> >> >> >> On Nov 20, 2016, at 3:02 PM, Igor Kasianov <[email protected]> wrote: >> >> Yes, thanks. >> Now I see, that You use repartition in DataSource.scala >> >> But I still have trouble with MAHOUT coocurrencyIDS: >> For test I build mahout 0.13.0-SNAPSHOT as suggested on actionml.com and >> add ParOpts to coocurrencyIDS (ParOpts(12, 12, false)) link >> <https://github.com/erebus1/template-scala-parallel-universal-recommendation/blob/custom/src/main/scala/URAlgorithm.scala#L149> >> min=12, exact=12, auto=False, >> >> But as a result it make 19 tasks on my dev machine, but only 3 on spark >> cluster. I can't find any adecuate documentation on mahout DRM.par, and >> can't understand this strange behaviour. >> >> It seems coocurrencyIDS do not take into account Spark parellism and >> ParOpts. >> >> Do You have any ideas, how can I control paralelism in coocurrencyIDS, >> because now it use only 3 cores of 12. >> >> Sincerely, >> Igor Kasianov >> >> 2016-11-19 23:04 GMT+02:00 Pat Ferrel <[email protected]>: >> >>> The current head of the template repo repartitions input based on >>> Spark's default parallelism, which I set on the `pio train` CLI to 4 x >>> #-of-cores. This speeds up the math drastically. There are still some >>> things that look like bottlenecks but taking them out make things slower. >>> The labels you see in the Spark GUI should be considered approximations. >>> >>> The parOpt is a mahout specific way to control partitioning and I avoid >>> it by using the Spark method. >>> >>> >>> On Nov 16, 2016, at 5:56 AM, Igor Kasianov <[email protected]> >>> wrote: >>> >>> Hi, >>> >>> I'm using UR template and have some trouble with scalability. >>> >>> Training take 18hours (each day) and last 12 hours it use only one core. >>> As I can see URAlgorithm.scala (line 144) call >>> SimilarityAnalysis.cooccurrencesIDSs >>> with data.actions (12 partitions) >>> >>> untill reduceByKey in AtB.scala it executes in parallel >>> but after this it executing in single thread. >>> >>> It is strange, that when SimilarityAnalysis.scala(line 145) call >>> indexedDatasets(0).create(drm, indexedDatasets(0).columnIDs, >>> indexedDatasets(i).columnIDs) >>> it return IndexedDataset with only one partition. >>> >>> As I can see in SimilarityAnalysis.scala(line 63) >>> drmARaw.par(auto = true) >>> May be this cause decreasing the number of partitions. >>> As I can see in master branch of MAHOUT >>> has ParOpt: >>> https://github.com/apache/mahout/blob/master/math-scala/src/ >>> main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L142 >>> May be this can fix the problem. >>> >>> So, am I right with root of problems, and how can I fix it? >>> >>> >>> <Screenshot from 2016-11-16 15:42:36.png> >>> I have spark cluster with 12 Cores and 128GB but with increasing number >>> of events, I can't scale UR, beause of this bottleneck >>> >>> P.S., please do not suggest to use event window (I've already use it. >>> but daily numer of events are increasing) >>> >>> >> >> > >
