We appear to have encountered an issue with cassandra 0.7.5 after upgrading from 0.7.2. While doing a batch read using a get_range_slice against the ranges an individual node is master for we are able to reproduce consistently that the last two nodes in the ring, regardless of the ring size (we have a 60 node production cluster and a 12 node test cluster) perform this read over the network using replicas of executing locally. Every other node in the ring successfully reads locally.
To be sure there were no data consistency issues we performed a nodetool repair against both of these nodes and the issue persists. We also tried truncating the column family and repopulating, but the issue remains. This seems to be related to CASSANDRA-2286 in 0.7.4. We always want to read data locally if it is available there. We use Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits(). Adam