Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Imran Rashid
You can set the number of partitions dynamically -- its just a parameter to a method, so you can compute it however you want, it doesn't need to be some static constant: val dataSizeEstimate = yourMagicFunctionToEstimateDataSize() val numberOfPartitions =

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Hm, what do you mean? You can control, to some extent, the number of partitions when you read the data, and can repartition if needed. You can set the default parallelism too so that it takes effect for most ops thay create an RDD. One # of partitions is usually about right for all work (2x or so

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Jeff Zhang
Hi Sean, If you know a stage needs unusually high parallelism for example you can repartition further for that stage. The problem is we may don't know whether high parallelism is needed. e.g. for the join operator, high parallelism may only be necessary for some dataset that lots of data can

Re: Is the RDD's Partitions determined before hand ?

2015-03-04 Thread Sean Owen
Parallelism doesn't really affect the throughput as long as it's: - not less than the number of available execution slots, - ... and probably some low multiple of them to even out task size effects - not so high that the bookkeeping overhead dominates Although you may need to select different

Re: Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Jeff Zhang
Thanks Sean. But if the partitions of RDD is determined before hand, it would not be flexible to run the same program on the different dataset. Although for the first stage the partitions can be determined by the input data set, for the intermediate stage it is not possible. Users have to create

Re: Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Sean Owen
An RDD has a certain fixed number of partitions, yes. You can't change an RDD. You can repartition() or coalese() and RDD to make a new one with a different number of RDDs, possibly requiring a shuffle. On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang zjf...@gmail.com wrote: I mean is it possible to