Hi I think pushing filter up would be best. Essentially, I would suggest having smallish partitions and filter the data. Then repartition 10k records using numPartition=10 and then write to cassandra.
Best Ayan On Mon, May 11, 2015 at 5:03 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > Did you try repartitioning? You might end up with a lot of time spending > on GC though. > > Thanks > Best Regards > > On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar <vijaypawnar...@gmail.com > > wrote: > >> I am using the Spark Cassandra connector to work with a table with 3 >> million records. Using .where() API to work with only a certain rows in >> this table. Where clause filters the data to 10000 rows. >> >> CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, >> MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter, >> cqlFilterParams) >> >> >> Also using parameter spark.cassandra.input.split.size=1000 >> >> As this job is processed by Spark cluster, it created 3000 partitions >> instead of 10. On spark cluster 3000 tasks are being executed. As the data >> in our table grows to 30 million rows, this will create 30,000 tasks >> instead of 10. >> >> Is there a better way to approach process these 10,000 records with 10 >> tasks. >> >> Thanks! >> >> > -- Best Regards, Ayan Guha