Re: Best way to partition RDD

Helena Edelson Thu, 30 Oct 2014 10:27:55 -0700

Shahab,

Regardless, WRT cassandra and spark when using the spark cassandra connector,  
‘spark.cassandra.input.split.size’ passed into the SparkConf configures the 
approx number of Cassandra partitions in a Spark partition (default 100000).
No repartitioning should be necessary with what you have below, but I don’t 
know if you are running on one node or a cluster.


This is a good initial guide: 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37

Cheers,
Helena
@helenaedelson

On Oct 30, 2014, at 1:12 PM, Helena Edelson <helena.edel...@datastax.com> wrote:

> Hi Shahab, 
> -How many spark/cassandra nodes are in your cluster?
> -What is your deploy topology for spark and cassandra clusters? Are they 
> co-located?
> 
> - Helena
> @helenaedelson
> 
> On Oct 30, 2014, at 12:16 PM, shahab <shahab.mok...@gmail.com> wrote:
> 
>> Hi.
>> 
>> I am running an application in the Spark which first loads data from 
>> Cassandra and then performs some map/reduce jobs.
>> val srdd = sqlContext.sql("select * from mydb.mytable "  )
>> 
>> I noticed that the "srdd" only has one partition . no matter how big is the 
>> data loaded form Cassandra.
>> 
>> So I perform "repartition" on the RDD , and then I did the map/reduce 
>> functions.
>> 
>> But the main problem is that "repartition" takes so much time (almost 2 
>> min), which is not acceptable in my use-case. Is there any better way to do 
>> repartitioning?
>> 
>> best,
>> /Shahab
>

Re: Best way to partition RDD

Reply via email to