RE: Help with processing multiple RDDs

2014-11-11 Thread Kapil Malik
Hi,

How is 78g distributed in driver, daemon, executor ?

Can you please paste the logs regarding  that I don't have enough memory to 
hold the data in memory
Are you collecting any data in driver ?

Lastly, did you try doing a re-partition to create smaller and evenly 
distributed partitions?

Regards,

Kapil 

-Original Message-
From: akhandeshi [mailto:ami.khande...@gmail.com] 
Sent: 12 November 2014 03:44
To: u...@spark.incubator.apache.org
Subject: Help with processing multiple RDDs

I have been struggling to process a set of RDDs.  Conceptually, it is is not a 
large data set. It seems, no matter how much I provide to JVM or partition, I 
can't seem to process this data.  I am caching the RDD.  I have tried 
persit(disk and memory), perist(memory) and persist(off_heap) with no success.  
Currently I am giving 78g to my driver, daemon and executor
memory.   

Currently, it seems to have trouble with one of the largest partition,
rdd_22_29 which is 25.9 GB.  

The metrics page shows Summary Metrics for 29 Completed Tasks.  However, I 
don't see few partitions on the list below.  However, i do seem to have 
warnings in the log file, indicating that I don't have enough memory to hold 
the data in memory.  I don't understand, what I am doing wrong or how I can 
troubleshoot. Any pointers will be appreciated...

14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition
rdd_22_20 in memory! Free memory is 17190150496 bytes.
14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition
rdd_22_13 in memory! Free memory is 17190150496 bytes.


Block Name  Storage Level   Size in Memory  Size on DiskExecutors
rdd_22_0Memory Deserialized 1x Replicated   2.1 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_10   Memory Deserialized 1x Replicated   7.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_11   Memory Deserialized 1x Replicated   1290.2 MB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_12   Memory Deserialized 1x Replicated   1167.7 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_14   Memory Deserialized 1x Replicated   3.8 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_15   Memory Deserialized 1x Replicated   4.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_16   Memory Deserialized 1x Replicated   2.4 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_17   Memory Deserialized 1x Replicated   37.6 MB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_18   Memory Deserialized 1x Replicated   120.9 MB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_19   Memory Deserialized 1x Replicated   755.9 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_2Memory Deserialized 1x Replicated   289.5 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_21   Memory Deserialized 1x Replicated   11.9 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_22   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_23   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_24   Memory Deserialized 1x Replicated   3.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_25   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_26   Memory Deserialized 1x Replicated   4.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_27   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_28   Memory Deserialized 1x Replicated   1846.1 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_29   Memory Deserialized 1x Replicated   25.9 GB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_3Memory Deserialized 1x Replicated   267.1 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_4Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_5Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_6Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_7Memory Deserialized 1x Replicated   14.8 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_8Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_9Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 

Re: Help with processing multiple RDDs

2014-11-11 Thread buring
i think you can try to set lower spark.storage.memoryFraction,for example 0.4
conf.set(spark.storage.memoryFraction,0.4)  //default 0.6



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Help with processing multiple RDDs

2014-11-11 Thread Khandeshi, Ami
I am running as Local in client mode.  I have allocated as high as 85g to the 
driver, executor, and daemon.   When I look at java processes.  I see two.  I 
see
20974 SparkSubmitDriverBootstrapper
21650 Jps
21075 SparkSubmit
I have tried repartition before, but my understanding is that comes with an 
overhead.  In my previous attempt, I didn't achieve much success.  I am not 
clear, how to best get even partitions, any thoughts??

I am caching the RDD, and performing count on the keys.

I am running it again, with repartitioning on the dataset. Let us see if that 
helps!  I will send you the logs as soon as this completes!

Thank you,  I sincerely appreciate your help!

Regards,

Ami

-Original Message-
From: Kapil Malik [mailto:kma...@adobe.com] 
Sent: Tuesday, November 11, 2014 9:05 PM
To: akhandeshi; u...@spark.incubator.apache.org
Subject: RE: Help with processing multiple RDDs

Hi,

How is 78g distributed in driver, daemon, executor ?

Can you please paste the logs regarding  that I don't have enough memory to 
hold the data in memory
Are you collecting any data in driver ?

Lastly, did you try doing a re-partition to create smaller and evenly 
distributed partitions?

Regards,

Kapil 

-Original Message-
From: akhandeshi [mailto:ami.khande...@gmail.com] 
Sent: 12 November 2014 03:44
To: u...@spark.incubator.apache.org
Subject: Help with processing multiple RDDs

I have been struggling to process a set of RDDs.  Conceptually, it is is not a 
large data set. It seems, no matter how much I provide to JVM or partition, I 
can't seem to process this data.  I am caching the RDD.  I have tried 
persit(disk and memory), perist(memory) and persist(off_heap) with no success.  
Currently I am giving 78g to my driver, daemon and executor
memory.   

Currently, it seems to have trouble with one of the largest partition,
rdd_22_29 which is 25.9 GB.  

The metrics page shows Summary Metrics for 29 Completed Tasks.  However, I 
don't see few partitions on the list below.  However, i do seem to have 
warnings in the log file, indicating that I don't have enough memory to hold 
the data in memory.  I don't understand, what I am doing wrong or how I can 
troubleshoot. Any pointers will be appreciated...

14/11/11 21:28:45 WARN CacheManager: Not enough space to cache partition
rdd_22_20 in memory! Free memory is 17190150496 bytes.
14/11/11 21:29:27 WARN CacheManager: Not enough space to cache partition
rdd_22_13 in memory! Free memory is 17190150496 bytes.


Block Name  Storage Level   Size in Memory  Size on DiskExecutors
rdd_22_0Memory Deserialized 1x Replicated   2.1 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_10   Memory Deserialized 1x Replicated   7.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_11   Memory Deserialized 1x Replicated   1290.2 MB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_12   Memory Deserialized 1x Replicated   1167.7 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_14   Memory Deserialized 1x Replicated   3.8 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_15   Memory Deserialized 1x Replicated   4.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_16   Memory Deserialized 1x Replicated   2.4 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_17   Memory Deserialized 1x Replicated   37.6 MB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_18   Memory Deserialized 1x Replicated   120.9 MB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_19   Memory Deserialized 1x Replicated   755.9 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_2Memory Deserialized 1x Replicated   289.5 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_21   Memory Deserialized 1x Replicated   11.9 KB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_22   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_23   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_24   Memory Deserialized 1x Replicated   3.0 MB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_25   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_26   Memory Deserialized 1x Replicated   4.0 GB  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_27   Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_28   Memory Deserialized 1x Replicated   1846.1 KB   0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_29   Memory Deserialized 1x Replicated   25.9 GB 0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_3Memory Deserialized 1x Replicated   267.1 KB0.0 B
mddworker.c.fi-mdd-poc.internal:54974
rdd_22_4Memory Deserialized 1x Replicated   24.0 B  0.0 B
mddworker.c.fi-mdd