Did you try repartitioning? You might end up with a lot of time spending on
GC though.

Thanks
Best Regards

On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar <vijaypawnar...@gmail.com>
wrote:

> I am using the Spark Cassandra connector to work with a table with 3
> million records. Using .where() API to work with only a certain rows in
> this table. Where clause filters the data to 10000 rows.
>
> CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
> MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter,
> cqlFilterParams)
>
>
> Also using parameter spark.cassandra.input.split.size=1000
>
> As this job is processed by Spark cluster, it created 3000 partitions
> instead of 10. On spark cluster 3000 tasks are being executed. As the data
> in our table grows to 30 million rows, this will create 30,000 tasks
> instead of 10.
>
> Is there a better way to approach process these 10,000 records with 10
> tasks.
>
> Thanks!
>
>

Reply via email to