The nature of replication factor is such that writes will go wherever there
is replication. If you're wanting responses to be faster, and not involve
the REST data center in the spark job for response I suggest using a cql
driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark
cassandra connector here
https://github.com/datastax/spark-cassandra-connector ) . While write
traffic will still be replicated to the REST service data center, because
you do want those results available, you will not be waiting on the remote
data center to respond "successful".

Final point, bulk loading sends a copy per replica across the wire, so lets
say you have RF3 in each data center that means bulk loading will send out
6 copies from that client at once, with normal mutations via thrift or cql
writes between data centers go out as 1 copy, then that node will forward
on to the other replicas. This means intra data center traffic in this case
would be 3x more with the bulk loader than with using a traditional cql or
thrift based client.



On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <bewang.t...@gmail.com> wrote:

> I set up two virtual data centers, one for analytics and one for REST
> service. The analytics data center sits top on Hadoop cluster. I want to
> bulk load my ETL results into the analytics data center so that the REST
> service won't have the heavy load. I'm using CQLTableInputFormat in my
> Spark Application, and I gave the nodes in analytics data center as
> Intialial address.
>
> However, I found my jobs were connecting to the REST service data center.
>
> How can I specify the data center?
>



-- 

Thanks,
Ryan Svihla

Reply via email to