Sorry, wasn’t very clear (looks like Pavan’s response was dropped from list for some reason as well).
I am assuming that: 1) the first map is CPU bound 2) the second map is heavily memory bound To be specific, lets saw you are using 4 m3.2xlarge instances which have 8 CPUs and 30GB of ram each for a total of 32 cores and 120GB of ram. Since the NLP model can’t be distributed that means every worker/core must use 4GB of RAM. If the cluster is fully utilized that means that just for the NLP model you are consuming 32 * 4GB = 128GB of ram. The cluster at this point is out of memory just for the NLP model not considering the data set itself. My suggestion would be see if r3.8xlarge instances will work (or even X1s if you have access) since the cpu/memory fraction is better. Here is the “hack” I proposed in more detail (basically n partitions < total cores): 1) have the first map have a regular number of partitions, suppose 32 * 4 = 128 which is a reasonable starting place 2) repartition immediately after that map to 16 partitions. At this point, spark is not guaranteed to distributed you work evenly across the 4 nodes, but it probably will. The net result is that half the CPU cores are idle, but the NLP model is at worse using 16 * 4GB = 64GB of RAM. To be sure, this is a hack since the nodes being evenly distributed work is not guaranteed. If you wanted to do this as not a hack, you could perform the map, checkpoint your work, end the job, then submit a new job where the cpu/memory ratio is more favorable which reads from the prior job’s output. I am guessing this heavily depends on how expensive reloading the data set from disk/network is. Hopefully one of these helps, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 17, 2016 at 6:16:41 AM, Jacek Laskowski (ja...@japila.pl) wrote: Hi, How would that help?! Why would you do that? Jacek On 17 Jul 2016 7:19 a.m., "Pedro Rodriguez" <ski.rodrig...@gmail.com> wrote: You could call map on an RDD which has “many” partitions, then call repartition/coalesce to drastically reduce the number of partitions so that your second map job has less things running. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 16, 2016 at 4:46:04 PM, Jacek Laskowski (ja...@japila.pl) wrote: Hi, My understanding is that these two map functions will end up as a job with one stage (as if you wrote the two maps as a single map) so you really need as much vcores and memory as possible for map1 and map2. I initially thought about dynamic allocation of executors that may or may not help you with the case, but since there's just one stage I don't think you can do much. Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <pacha...@sysomos.com> wrote: > Hi All, > > Here is my use case: > > I have a pipeline job consisting of 2 map functions: > > CPU intensive map operation that does not require a lot of memory. > Memory intensive map operation that requires upto 4 GB of memory. And this > 4GB memory cannot be distributed since it is an NLP model. > > Ideally what I like to do is to use 20 nodes with 4 cores each and minimal > memory for first map operation and then use only 3 nodes with minimal CPU > but each having 4GB of memory for 2nd operation. > > While it is possible to control this parallelism for each map operation in > spark. I am not sure how to control the resources for each operation. > Obviously I don’t want to start off the job with 20 nodes with 4 cores and > 4GB memory since I cannot afford that much memory. > > We use Yarn with Spark. Any suggestions ? > > Thanks and regards, > Pavan > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org