Hi

I think pushing filter up would be best. Essentially, I would suggest
having smallish partitions and filter the data. Then repartition 10k
records using numPartition=10 and then write to cassandra.

Best
Ayan

On Mon, May 11, 2015 at 5:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Did you try repartitioning? You might end up with a lot of time spending
> on GC though.
>
> Thanks
> Best Regards
>
> On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar <vijaypawnar...@gmail.com
> > wrote:
>
>> I am using the Spark Cassandra connector to work with a table with 3
>> million records. Using .where() API to work with only a certain rows in
>> this table. Where clause filters the data to 10000 rows.
>>
>> CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
>> MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter,
>> cqlFilterParams)
>>
>>
>> Also using parameter spark.cassandra.input.split.size=1000
>>
>> As this job is processed by Spark cluster, it created 3000 partitions
>> instead of 10. On spark cluster 3000 tasks are being executed. As the data
>> in our table grows to 30 million rows, this will create 30,000 tasks
>> instead of 10.
>>
>> Is there a better way to approach process these 10,000 records with 10
>> tasks.
>>
>> Thanks!
>>
>>
>


-- 
Best Regards,
Ayan Guha

Reply via email to