Sorry, wasn’t very clear (looks like Pavan’s response was dropped from list for 
some reason as well).

I am assuming that:
1) the first map is CPU bound
2) the second map is heavily memory bound

To be specific, lets saw you are using 4 m3.2xlarge instances which have 8 CPUs 
and 30GB of ram each for a total of 32 cores and 120GB of ram. Since the NLP 
model can’t be distributed that means every worker/core must use 4GB of RAM. If 
the cluster is fully utilized that means that just for the NLP model you are 
consuming 32 * 4GB = 128GB of ram. The cluster at this point is out of memory 
just for the NLP model not considering the data set itself. My suggestion would 
be see if r3.8xlarge instances will work (or even X1s if you have access) since 
the cpu/memory fraction is better. Here is the “hack” I proposed in more detail 
(basically n partitions < total cores):

1) have the first map have a regular number of partitions, suppose 32 * 4 = 128 
which is a reasonable starting place
2) repartition immediately after that map to 16 partitions. At this point, 
spark is not guaranteed to distributed you work evenly across the 4 nodes, but 
it probably will. The net result is that half the CPU cores are idle, but the 
NLP model is at worse using 16 * 4GB = 64GB of RAM. To be sure, this is a hack 
since the nodes being evenly distributed work is not guaranteed. 

If you wanted to do this as not a hack, you could perform the map, checkpoint 
your work, end the job, then submit a new job where the cpu/memory ratio is 
more favorable which reads from the prior job’s output. I am guessing this 
heavily depends on how expensive reloading the data set from disk/network is. 

Hopefully one of these helps,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 17, 2016 at 6:16:41 AM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,

How would that help?! Why would you do that?

Jacek


On 17 Jul 2016 7:19 a.m., "Pedro Rodriguez" <ski.rodrig...@gmail.com> wrote:
You could call map on an RDD which has “many” partitions, then call 
repartition/coalesce to drastically reduce the number of partitions so that 
your second map job has less things running.

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 16, 2016 at 4:46:04 PM, Jacek Laskowski (ja...@japila.pl) wrote:

Hi,

My understanding is that these two map functions will end up as a job
with one stage (as if you wrote the two maps as a single map) so you
really need as much vcores and memory as possible for map1 and map2. I
initially thought about dynamic allocation of executors that may or
may not help you with the case, but since there's just one stage I
don't think you can do much.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <pacha...@sysomos.com> wrote:
> Hi All,
>
> Here is my use case:
>
> I have a pipeline job consisting of 2 map functions:
>
> CPU intensive map operation that does not require a lot of memory.
> Memory intensive map operation that requires upto 4 GB of memory. And this
> 4GB memory cannot be distributed since it is an NLP model.
>
> Ideally what I like to do is to use 20 nodes with 4 cores each and minimal
> memory for first map operation and then use only 3 nodes with minimal CPU
> but each having 4GB of memory for 2nd operation.
>
> While it is possible to control this parallelism for each map operation in
> spark. I am not sure how to control the resources for each operation.
> Obviously I don’t want to start off the job with 20 nodes with 4 cores and
> 4GB memory since I cannot afford that much memory.
>
> We use Yarn with Spark. Any suggestions ?
>
> Thanks and regards,
> Pavan
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to