The nature of replication factor is such that writes will go wherever there is replication. If you're wanting responses to be faster, and not involve the REST data center in the spark job for response I suggest using a cql driver and LOCAL_ONE or LOCAL_QUORUM consistency level (look at the spark cassandra connector here https://github.com/datastax/spark-cassandra-connector ) . While write traffic will still be replicated to the REST service data center, because you do want those results available, you will not be waiting on the remote data center to respond "successful".
Final point, bulk loading sends a copy per replica across the wire, so lets say you have RF3 in each data center that means bulk loading will send out 6 copies from that client at once, with normal mutations via thrift or cql writes between data centers go out as 1 copy, then that node will forward on to the other replicas. This means intra data center traffic in this case would be 3x more with the bulk loader than with using a traditional cql or thrift based client. On Wed, Jan 7, 2015 at 6:32 PM, Benyi Wang <bewang.t...@gmail.com> wrote: > I set up two virtual data centers, one for analytics and one for REST > service. The analytics data center sits top on Hadoop cluster. I want to > bulk load my ETL results into the analytics data center so that the REST > service won't have the heavy load. I'm using CQLTableInputFormat in my > Spark Application, and I gave the nodes in analytics data center as > Intialial address. > > However, I found my jobs were connecting to the REST service data center. > > How can I specify the data center? > -- Thanks, Ryan Svihla