Hello shreesh,
That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map
2) The communication
I realize that there are a lot of ways to configure my application in spark.
The part that is not clear is that how do I decide say for example in how
many partitions should I divide my data or how much ram should I have or how
many workers should one initialize?
--
View this message in
Subject: Re: How does one decide no of executors/cores/memory allocation?
I realize that there are a lot of ways to configure my application in spark.
The part that is not clear is that how do I decide say for example in how
many partitions should I divide my data or how much ram should I have or how
Hi Shreesh,
You can definitely decide the how many partitions your data should break
into by passing a, 'minPartition' argument in the method
sc.textFile(input/path, minPartition) and 'numSlices' arg in method
sc.parallelize(localCollection, numSlices). In fact there is always a option
to specify
When you submit a job, spark breaks down it into stages, as per DAG. the
stages run transformations or actions on the rdd's. Each rdd constitutes of
N partitions. The tasks creates by spark to execute the stage are equal to
the number of partitions. Every task is executed on the cored utilized