Query regarding CassandraJavaRDD while running spark job on cassandra

Siddharth Verma Fri, 11 Mar 2016 21:52:49 -0800

In cassandra I have a table with the following schema.

CREATE TABLE my_keyspace.my_table1 (
    col_1 text,
    col_2 text,
    col_3 text,
    col_4 text,,
    col_5 text,
    col_6 text,
    col_7 text,
    PRIMARY KEY (col_1, col_2, col_3)
) WITH CLUSTERING ORDER BY (col_2 ASC, col_3 ASC);


For processing I create a spark job.

CassandraJavaRDD<CassandraRow> data1 =
function.cassandraTable("my_keyspace", "my_table1")


1. Does it guarantee mutual exclusivity of fetched rows across all RDDs
which are on worker nodes?
(At the cost of redundancy and verbosity, I will reiterate.
Suppose I have an entry in the table : ('1','2','3','4','5','6','7')
What I mean to ask is, when I perform transformations/actions on data1
RDD), can I be sure that the above entry will be present on ONLY ONE worker
node?)

2. All the data pertaining to one partition will be on one node?
(Suppose I have the following entries in the table :
('p1','c2_1','c3_1','4','5','6','7')
('p1','c2_2','c3'_2,'4','5','6','7')
('p1','c2_3','c3_3','4','5','6','7')
('p1','c2_4','c3_4','4','5','6','7')
('p1' ........)
('p1' ........)
('p1' ........)
All the data for the same partition will be present on only one node?
)

3. If i have a DC specifically for analytics, and I place the spark worker
on the same machines as cassandra node, for that entire DC.
Can I make sure that the spark worker fetches the data from the token range
present on that node? (I.E. the node does't fetch data present on different
node)
3.1 (as with the above statement which doesn't have a 'where' clause).
3.2 (as with the above statement which has a 'where' clause).

Query regarding CassandraJavaRDD while running spark job on cassandra

Reply via email to