Hi Aniruddh, Increasing number of partitions doesn't always help in ALS due to communication/computation trade-off. What rank did you set? If the rank is not large, I'd recommend a small number of partitions. There are some other numbers to watch. Do you have super popular items/users in your data?
Best, Xiangrui On Wed, Jul 8, 2015 at 5:32 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Also try to increase the number of partions gradually – not in one big jump > from 20 to 100 but adding e.g. 10 at a time and see whether there is a > correlation with adding more RAM to the executors > > > > From: Evo Eftimov [mailto:evo.efti...@isecc.com] > Sent: Wednesday, July 8, 2015 1:26 PM > To: 'Aniruddh Sharma'; 'user@spark.apache.org' > Subject: RE: Out of Memory Errors on less number of cores in proportion to > Partitions in Data > > > > Are you sure you have actually increased the RAM (how exactly did you do > that and does it show in Spark UI) > > > > Also use the SPARK UI and the driver console to check the RAM allocated for > each RDD and RDD partion in each of the scenarios > > > > Re b) the general rule is num of partitions = 2 x num of CPU cores > > > > All partitions are operated in parallel (by independently running JVM > Threads), however if you have substantially higher num of partitions (JVM > Threads) than num of core then you will get what happens in any JVM or OS – > there will be switching between the Threads and some of them will be in a > suspended mode waiting for free core (Thread contexts also occupy additional > RAM ) > > > > From: Aniruddh Sharma [mailto:asharma...@gmail.com] > Sent: Wednesday, July 8, 2015 12:52 PM > To: Evo Eftimov > Subject: Re: Out of Memory Errors on less number of cores in proportion to > Partitions in Data > > > > Thanks for your revert... > > I increased executor memory from 4GB to 35 GB and still out of memory error > happens. So it seems it may not be entirely due to more buffers due to more > partitions. > > Query a) Is there a way to debug at more granular level from user code > perspective where things could go wrong. > > > > Query b) > > In general my query is lets suppose it is not ALS (or some iterative > algorithm). Lets say it is some sample RDD but which 10000 partitions and > each executor has 50 partitions and each machine has 4 physical cores.So do > 4 physical cores parallely try to process these 50 partitions (doing > multitasking) or will it work in a way that 4 cores will first process first > 4 partitions and then next 4 partitions and so on. > > Thanks and Regards > > Aniruddh > > > > On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > > This is most likely due to the internal implementation of ALS in MLib. > Probably for each parallel unit of execution (partition in Spark terms) the > implementation allocates and uses a RAM buffer where it keeps interim > results during the ALS iterations > > > > If we assume that the size of that internal RAM buffer is fixed per Unit of > Execution then Total RAM (20 partitions x fixed RAM buffer) < Total RAM (100 > partitions x fixed RAM buffer) > > > > From: Aniruddh Sharma [mailto:asharma...@gmail.com] > Sent: Wednesday, July 8, 2015 12:22 PM > To: user@spark.apache.org > Subject: Out of Memory Errors on less number of cores in proportion to > Partitions in Data > > > > Hi, > > I am new to Spark. I have done following tests and I am confused in > conclusions. I have 2 queries. > > Following is the detail of test > > Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical > cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 > executors and my Rating data set has 20 partitions. It works. In order to > increase parallelism, I did 100 partitions instead of 20 and now program > does not work and it throws out of memory error. > > > > Query a): As I had 4 cores on each machine , but my number of partitions are > 10 in each executor and my cores are not sufficient for partitions. Is it > supposed to give memory errors when this kind of misconfiguration.If there > are not sufficient cores and processing cannot be done in parallel, can > different partitions not be processed sequentially and operation could have > become slow rather than throwing memory error. > > Query b) If it gives error, then error message is not meaningful Here my > DAG was very simple and I could trace that lowering number of partitions is > working, but if on misconfiguration of cores it throws error, then how to > debug it in complex DAGs as error does not tell explicitly that problem > could be due to low number of cores. If my understanding is incorrect, then > kindly explain the reasons of error in this case > > > > Thanks and Regards > > Aniruddh > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org