Hi Ryan,

As I said, saveToCassandra doesn't support "DELETE". This is why I modified
the code of spark-cassandra-connector to allow me have DELETEs. What I
change is how to bind a RDD row into a batch of CQL preparedStatements.



On Fri, Sep 25, 2015 at 7:22 AM, Ryan Svihla <r...@foundev.pro> wrote:

> Why aren’t you using saveToCassandra (
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md)?
> They have a number of locality aware optimizations that will probably
> exceed your by hand bulk loading (especially if you’re not doing it inside
> something like foreach partition).
>
> Also you can easily tune up and down the size of those tasks and therefore
> batches to minimize harm on the prod system.
>
> On Sep 24, 2015, at 5:37 PM, Benyi Wang <bewang.t...@gmail.com> wrote:
>
> I use Spark and spark-cassandra-connector with a customized Cassandra
> writer (spark-cassandra-connector doesn’t support DELETE). Basically the
> writer works as follows:
>
>    - Bind a row in Spark RDD with either INSERT/Delete PreparedStatement
>    - Create a BatchStatement for multiple rows
>    - Write to Cassandra.
>
> I knew using CQLBulkOutputFormat would be better, but it doesn't supports
> DELETE.
> ​
>
> On Thu, Sep 24, 2015 at 1:27 PM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> How are you loading the data? I mean, what insert method are you using?
>>
>> On Thu, Sep 24, 2015 at 9:58 PM, Benyi Wang <bewang.t...@gmail.com>
>> wrote:
>>
>>> I have a cassandra cluster provides data to a web service. And there is
>>> a daily batch load writing data into the cluster.
>>>
>>>    - Without the batch loading, the service’s Latency 99thPercentile is
>>>    3ms. But during the load, it jumps to 90ms.
>>>    - I checked cassandra keyspace’s ReadLatency.99thPercentile, which
>>>    jumps to 1ms from 600 microsec.
>>>    - The service’s cassandra java driver request 99thPercentile was
>>>    90ms during the load
>>>
>>> The java driver took the most time. I knew the Cassandra servers are
>>> busy in writing, but I want to know what kinds of metrics can identify
>>> where is the bottleneck so that I can tune it.
>>>
>>> I’m using Cassandra 2.1.8 and Cassandra Java Driver 2.1.5.
>>> ​
>>>
>>
>>
>
> Regards,
>
> Ryan Svihla
>
>

Reply via email to