Re: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-28 Thread Xiangrui Meng
Hi Aniruddh,

Increasing number of partitions doesn't always help in ALS due to
communication/computation trade-off. What rank did you set? If the
rank is not large, I'd recommend a small number of partitions. There
are some other numbers to watch. Do you have super popular items/users
in your data?

Best,
Xiangrui

On Wed, Jul 8, 2015 at 5:32 AM, Evo Eftimov evo.efti...@isecc.com wrote:
 Also try to increase the number of partions gradually – not in one big jump
 from 20 to 100 but adding e.g. 10 at a time and see whether there is a
 correlation with adding more RAM to the executors



 From: Evo Eftimov [mailto:evo.efti...@isecc.com]
 Sent: Wednesday, July 8, 2015 1:26 PM
 To: 'Aniruddh Sharma'; 'user@spark.apache.org'
 Subject: RE: Out of Memory Errors on less number of cores in proportion to
 Partitions in Data



 Are you sure you have actually increased the RAM (how exactly did you do
 that and does it show in Spark UI)



 Also use the SPARK UI and the driver console  to check the RAM allocated for
 each RDD and RDD partion in each of the scenarios



 Re b) the general rule is num of partitions = 2 x num of CPU cores



 All partitions are operated in parallel (by independently running JVM
 Threads), however if you have substantially higher num of partitions (JVM
 Threads) than num of core then you will get what happens in any JVM or OS –
 there will be switching between the Threads and some of them will be in a
 suspended mode waiting for free core (Thread contexts also occupy additional
 RAM )



 From: Aniruddh Sharma [mailto:asharma...@gmail.com]
 Sent: Wednesday, July 8, 2015 12:52 PM
 To: Evo Eftimov
 Subject: Re: Out of Memory Errors on less number of cores in proportion to
 Partitions in Data



 Thanks for your revert...

 I increased executor memory from 4GB to 35 GB and still out of memory error
 happens. So it seems it may not be entirely due to more buffers due to more
 partitions.

 Query a) Is there a way to debug at more granular level from user code
 perspective where things could go wrong.



 Query b)

 In general my query is lets suppose it is not ALS (or some iterative
 algorithm). Lets say it is some sample RDD but which 1 partitions and
 each executor has 50 partitions and each machine has 4 physical cores.So do
 4 physical cores parallely try to process these 50 partitions (doing
 multitasking) or will it work in a way that 4 cores will first process first
 4 partitions and then next 4 partitions and so on.

 Thanks and Regards

 Aniruddh



 On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 This is most likely due to the internal implementation of ALS in MLib.
 Probably for each parallel unit of execution (partition in Spark terms) the
 implementation allocates and uses a RAM buffer where it keeps interim
 results during the ALS iterations



 If we assume that the size of that internal RAM buffer is fixed per Unit of
 Execution then Total RAM (20 partitions x fixed RAM buffer)  Total RAM (100
 partitions x fixed RAM buffer)



 From: Aniruddh Sharma [mailto:asharma...@gmail.com]
 Sent: Wednesday, July 8, 2015 12:22 PM
 To: user@spark.apache.org
 Subject: Out of Memory Errors on less number of cores in proportion to
 Partitions in Data



 Hi,

 I am new to Spark. I have done following tests and I am confused in
 conclusions. I have 2 queries.

 Following is the detail of test

 Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical
 cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10
 executors and my Rating data set has 20 partitions. It works. In order to
 increase parallelism, I did 100 partitions instead of 20 and now program
 does not work and it throws out of memory error.



 Query a): As I had 4 cores on each machine , but my number of partitions are
 10 in each executor and my cores are not sufficient for partitions. Is it
 supposed to give memory errors when this kind of misconfiguration.If there
 are not sufficient cores and processing cannot be done in parallel, can
 different partitions not be processed sequentially and operation could have
 become slow rather than throwing memory error.

 Query b)  If it gives error, then error message is not meaningful Here my
 DAG was very simple and I could trace that lowering number of partitions is
 working, but if on misconfiguration of cores it throws error, then how to
 debug it in complex DAGs as error does not tell explicitly that problem
 could be due to low number of cores. If my understanding is incorrect, then
 kindly explain the reasons of error in this case



 Thanks and Regards

 Aniruddh



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
This is most likely due to the internal implementation of ALS in MLib. Probably 
for each parallel unit of execution (partition in Spark terms) the 
implementation allocates and uses a RAM buffer where it keeps interim results 
during the ALS iterations

 

If we assume that the size of that internal RAM buffer is fixed per Unit of 
Execution then Total RAM (20 partitions x fixed RAM buffer)  Total RAM (100 
partitions x fixed RAM buffer) 

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:22 PM
To: user@spark.apache.org
Subject: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Hi,

I am new to Spark. I have done following tests and I am confused in 
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical 
cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 
executors and my Rating data set has 20 partitions. It works. In order to 
increase parallelism, I did 100 partitions instead of 20 and now program does 
not work and it throws out of memory error.

 

Query a): As I had 4 cores on each machine , but my number of partitions are 10 
in each executor and my cores are not sufficient for partitions. Is it supposed 
to give memory errors when this kind of misconfiguration.If there are not 
sufficient cores and processing cannot be done in parallel, can different 
partitions not be processed sequentially and operation could have become slow 
rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my DAG 
was very simple and I could trace that lowering number of partitions is 
working, but if on misconfiguration of cores it throws error, then how to debug 
it in complex DAGs as error does not tell explicitly that problem could be due 
to low number of cores. If my understanding is incorrect, then kindly explain 
the reasons of error in this case

 

Thanks and Regards

Aniruddh



Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Aniruddh Sharma
Hi,

I am new to Spark. I have done following tests and I am confused in
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4
physical cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I
ran 10 executors and my Rating data set has 20 partitions. It works. In
order to increase parallelism, I did 100 partitions instead of 20 and now
program does not work and it throws out of memory error.

Query a): As I had 4 cores on each machine , but my number of partitions
are 10 in each executor and my cores are not sufficient for partitions. Is
it supposed to give memory errors when this kind of misconfiguration.If
there are not sufficient cores and processing cannot be done in parallel,
can different partitions not be processed sequentially and operation could
have become slow rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my
DAG was very simple and I could trace that lowering number of partitions is
working, but if on misconfiguration of cores it throws error, then how to
debug it in complex DAGs as error does not tell explicitly that problem
could be due to low number of cores. If my understanding is incorrect, then
kindly explain the reasons of error in this case

Thanks and Regards
Aniruddh


RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Are you sure you have actually increased the RAM (how exactly did you do that 
and does it show in Spark UI)

 

Also use the SPARK UI and the driver console  to check the RAM allocated for 
each RDD and RDD partion in each of the scenarios  

 

Re b) the general rule is num of partitions = 2 x num of CPU cores

 

All partitions are operated in parallel (by independently running JVM Threads), 
however if you have substantially higher num of partitions (JVM Threads) than 
num of core then you will get what happens in any JVM or OS – there will be 
switching between the Threads and some of them will be in a suspended mode 
waiting for free core (Thread contexts also occupy additional RAM )

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Thanks for your revert...

I increased executor memory from 4GB to 35 GB and still out of memory error 
happens. So it seems it may not be entirely due to more buffers due to more 
partitions.

Query a) Is there a way to debug at more granular level from user code 
perspective where things could go wrong.

 

Query b) 

In general my query is lets suppose it is not ALS (or some iterative 
algorithm). Lets say it is some sample RDD but which 1 partitions and each 
executor has 50 partitions and each machine has 4 physical cores.So do 4 
physical cores parallely try to process these 50 partitions (doing 
multitasking) or will it work in a way that 4 cores will first process first 4 
partitions and then next 4 partitions and so on. 

Thanks and Regards

Aniruddh

 

On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com wrote:

This is most likely due to the internal implementation of ALS in MLib. Probably 
for each parallel unit of execution (partition in Spark terms) the 
implementation allocates and uses a RAM buffer where it keeps interim results 
during the ALS iterations

 

If we assume that the size of that internal RAM buffer is fixed per Unit of 
Execution then Total RAM (20 partitions x fixed RAM buffer)  Total RAM (100 
partitions x fixed RAM buffer) 

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:22 PM
To: user@spark.apache.org
Subject: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Hi,

I am new to Spark. I have done following tests and I am confused in 
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical 
cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 
executors and my Rating data set has 20 partitions. It works. In order to 
increase parallelism, I did 100 partitions instead of 20 and now program does 
not work and it throws out of memory error.

 

Query a): As I had 4 cores on each machine , but my number of partitions are 10 
in each executor and my cores are not sufficient for partitions. Is it supposed 
to give memory errors when this kind of misconfiguration.If there are not 
sufficient cores and processing cannot be done in parallel, can different 
partitions not be processed sequentially and operation could have become slow 
rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my DAG 
was very simple and I could trace that lowering number of partitions is 
working, but if on misconfiguration of cores it throws error, then how to debug 
it in complex DAGs as error does not tell explicitly that problem could be due 
to low number of cores. If my understanding is incorrect, then kindly explain 
the reasons of error in this case

 

Thanks and Regards

Aniruddh

 



RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump 
from 20 to 100 but adding e.g. 10 at a time and see whether there is a 
correlation with adding more RAM to the executors 

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Wednesday, July 8, 2015 1:26 PM
To: 'Aniruddh Sharma'; 'user@spark.apache.org'
Subject: RE: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Are you sure you have actually increased the RAM (how exactly did you do that 
and does it show in Spark UI)

 

Also use the SPARK UI and the driver console  to check the RAM allocated for 
each RDD and RDD partion in each of the scenarios  

 

Re b) the general rule is num of partitions = 2 x num of CPU cores

 

All partitions are operated in parallel (by independently running JVM Threads), 
however if you have substantially higher num of partitions (JVM Threads) than 
num of core then you will get what happens in any JVM or OS – there will be 
switching between the Threads and some of them will be in a suspended mode 
waiting for free core (Thread contexts also occupy additional RAM )

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Thanks for your revert...

I increased executor memory from 4GB to 35 GB and still out of memory error 
happens. So it seems it may not be entirely due to more buffers due to more 
partitions.

Query a) Is there a way to debug at more granular level from user code 
perspective where things could go wrong.

 

Query b) 

In general my query is lets suppose it is not ALS (or some iterative 
algorithm). Lets say it is some sample RDD but which 1 partitions and each 
executor has 50 partitions and each machine has 4 physical cores.So do 4 
physical cores parallely try to process these 50 partitions (doing 
multitasking) or will it work in a way that 4 cores will first process first 4 
partitions and then next 4 partitions and so on. 

Thanks and Regards

Aniruddh

 

On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com wrote:

This is most likely due to the internal implementation of ALS in MLib. Probably 
for each parallel unit of execution (partition in Spark terms) the 
implementation allocates and uses a RAM buffer where it keeps interim results 
during the ALS iterations

 

If we assume that the size of that internal RAM buffer is fixed per Unit of 
Execution then Total RAM (20 partitions x fixed RAM buffer)  Total RAM (100 
partitions x fixed RAM buffer) 

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:22 PM
To: user@spark.apache.org
Subject: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Hi,

I am new to Spark. I have done following tests and I am confused in 
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical 
cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 
executors and my Rating data set has 20 partitions. It works. In order to 
increase parallelism, I did 100 partitions instead of 20 and now program does 
not work and it throws out of memory error.

 

Query a): As I had 4 cores on each machine , but my number of partitions are 10 
in each executor and my cores are not sufficient for partitions. Is it supposed 
to give memory errors when this kind of misconfiguration.If there are not 
sufficient cores and processing cannot be done in parallel, can different 
partitions not be processed sequentially and operation could have become slow 
rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my DAG 
was very simple and I could trace that lowering number of partitions is 
working, but if on misconfiguration of cores it throws error, then how to debug 
it in complex DAGs as error does not tell explicitly that problem could be due 
to low number of cores. If my understanding is incorrect, then kindly explain 
the reasons of error in this case

 

Thanks and Regards

Aniruddh