Did you try repartitioning? You might end up with a lot of time spending on GC though.
Thanks Best Regards On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar <vijaypawnar...@gmail.com> wrote: > I am using the Spark Cassandra connector to work with a table with 3 > million records. Using .where() API to work with only a certain rows in > this table. Where clause filters the data to 10000 rows. > > CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, > MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter, > cqlFilterParams) > > > Also using parameter spark.cassandra.input.split.size=1000 > > As this job is processed by Spark cluster, it created 3000 partitions > instead of 10. On spark cluster 3000 tasks are being executed. As the data > in our table grows to 30 million rows, this will create 30,000 tasks > instead of 10. > > Is there a better way to approach process these 10,000 records with 10 > tasks. > > Thanks! > >