Thanks!. We can somewhat approximate number of rows returned by where(), as
a result we can approximate number of partitions, so repartition approach
will work.
Lets say if the .where() had resulted in widel varying number of rows, we
would not have been to approximate # of partition, that would caused
inefficiencies.

On Mon, May 11, 2015 at 4:50 AM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> I think pushing filter up would be best. Essentially, I would suggest
> having smallish partitions and filter the data. Then repartition 10k
> records using numPartition=10 and then write to cassandra.
>
> Best
> Ayan
>
> On Mon, May 11, 2015 at 5:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Did you try repartitioning? You might end up with a lot of time spending
>> on GC though.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar <
>> vijaypawnar...@gmail.com> wrote:
>>
>>> I am using the Spark Cassandra connector to work with a table with 3
>>> million records. Using .where() API to work with only a certain rows in
>>> this table. Where clause filters the data to 10000 rows.
>>>
>>> CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE,
>>> MY_TABLE, CassandraJavaUtil.mapRowTo(MyClass.class)).where(cqlDataFilter,
>>> cqlFilterParams)
>>>
>>>
>>> Also using parameter spark.cassandra.input.split.size=1000
>>>
>>> As this job is processed by Spark cluster, it created 3000 partitions
>>> instead of 10. On spark cluster 3000 tasks are being executed. As the data
>>> in our table grows to 30 million rows, this will create 30,000 tasks
>>> instead of 10.
>>>
>>> Is there a better way to approach process these 10,000 records with 10
>>> tasks.
>>>
>>> Thanks!
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
-Vijay

Reply via email to