I am thinking that if your data is sufficiently partitioned prior to
sortByKey(), say by the virtue of a prior groupByKey() or reduceByKey()
call; the sortByKey () that follows should have the number of tasks based
on that number of partitions.

OfCourse setting default parallelism will also work.


On Mon, Dec 9, 2013 at 10:46 PM, Matt Cheah <mch...@palantir.com> wrote:

>  Pardon me - I should be looking at JavaPairRDD, but my point still
> stands that there's no integer parameter for sortByKey() unlike their Scala
> counterparts:
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.api.java.JavaPairRDD
>  ------------------------------
> *From:* Ashish Rangole [arang...@gmail.com]
> *Sent:* Monday, December 09, 2013 7:41 PM
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: JavaRDD, Specify number of tasks
>
>   AFAIK yes.  IIRC, there is a 2nd parameter numPartitions that one can
> provide to these operations.
> On Dec 9, 2013 8:19 PM, "Matt Cheah" <mch...@palantir.com> wrote:
>
>>  Hi
>>
>>  When I use a JavaPairRDD's groupByKey(), reduceByKey(), or sortByKey(),
>> is there a way for me to specify the number of reduce tasks, as there is in
>> a scala RDD? Or do I have to set them all to use spark.default.parallelism?
>>
>>  Thanks,
>>
>>  -Matt Cheah
>>
>>  (feels like I've been asking a lot of questions as of lateā€¦)
>>
>

Reply via email to