You can set the number of partitions dynamically -- its just a parameter to
a method, so you can compute it however you want, it doesn't need to be
some static constant:
val dataSizeEstimate = yourMagicFunctionToEstimateDataSize()
val numberOfPartitions =
Hm, what do you mean? You can control, to some extent, the number of
partitions when you read the data, and can repartition if needed.
You can set the default parallelism too so that it takes effect for most
ops thay create an RDD. One # of partitions is usually about right for all
work (2x or so
Hi Sean,
If you know a stage needs unusually high parallelism for example you can
repartition further for that stage.
The problem is we may don't know whether high parallelism is needed. e.g.
for the join operator, high parallelism may only be necessary for some
dataset that lots of data can
Parallelism doesn't really affect the throughput as long as it's:
- not less than the number of available execution slots,
- ... and probably some low multiple of them to even out task size effects
- not so high that the bookkeeping overhead dominates
Although you may need to select different
Thanks Sean.
But if the partitions of RDD is determined before hand, it would not be
flexible to run the same program on the different dataset. Although for the
first stage the partitions can be determined by the input data set, for the
intermediate stage it is not possible. Users have to create
An RDD has a certain fixed number of partitions, yes. You can't change
an RDD. You can repartition() or coalese() and RDD to make a new one
with a different number of RDDs, possibly requiring a shuffle.
On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang zjf...@gmail.com wrote:
I mean is it possible to